Background

The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.

Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.

This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and leverages functions in ‘WeatherModelingFunctions_v001.R’.

Data Availability

There are three main processed files available for further exploration:

metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds

  • source (chr) - the reporting station and time
  • locale (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • origMETAR (chr) - the original METAR associated with the observation at that source and date-time
  • year (dbl) - the year, extracted from dtime
  • monthint (dbl) - the month, extracted from dtime, as an integer
  • month (fct) - the month, extracted from dtime, as a three-character abbreviation (factor)
  • day (int) - the day of the month, extracted from dtime
  • WindDir (chr) - previaling wind direction in degrees, stored as a character since ‘VRB’ means variable
  • WindSpeed (int) - the prevailing wind speed in knots
  • WindGust (dbl) - the wind gust speed in knots (NA if there is no recorded wind gust at that hour)
  • predomDir (chr) - the predominant wind direction as NE-E-SE-S-SW-W-NW-N-VRB-000-Error
  • Visibility (dbl) - surface visibility in statute miles
  • Altimeter (dbl) - altimeter in inches of mercury
  • TempF (dbl) - the Fahrenheit temperature
  • DewF (dbl) - the Fahrenheit dew point
  • modSLP (dbl) - Sea-Level Pressure (SLP), adjusted to reflect that SLP is recorded as 0-1000 but reflects data that are 950-1050
  • cTypen (chr) - the cloud type of the nth cloud layer (FEW, BKN, SCT, OVC, or VV)
  • cLeveln (dbl) - the cloud height in feet of the nth cloud layer
  • isRain (lgl) - was rain occurring at the moment the METAR was captured?
  • isSnow (lgl) - was snow occurring at the moment the METAR was captured?
  • isThunder (lgl) - was thunder occurring at the moment the METAR was captured?
  • p1Inches (dbl) - how many inches of rain occurred in the past hour?
  • p36Inches (dbl) - how many inches of rain occurred in the past 3/6 hours (3-hour summaries at 3Z-9Z-15Z-21Z and 6-hour summaries at 6Z-12Z-18Z-24Z and NA at any other Z times)?
  • p24Inches (dbl) - how many inches of rain occurred in the past 24 hours (at 12Z, NA at all other times)
  • tempFHi (dbl) - the high temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • tempFLo (dbl) - the low temperature in the past 24 hours, in Fahrenheit (reported once per day)
  • minHeight (dbl) - the minimum cloud height in feet (-100 means ‘no clouds’)
  • minType (fct) - amount of obscuration at the minimum cloud height (VV > OVC > BKN > SCT > FEW > CLR)
  • ceilingHeight (dbl) - the minimum cloud ceiling in feet (-100 means ‘no ceiling’)
  • ceilingType (fct) - amount of obscuration at the minimum ceiling height (VV > OVC > BKN)

metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds

  • source (chr) - the reporting station and time
  • sourceName (chr) - the descriptive name for source
  • dtime (dttm) - the date-time for the observation
  • level (dbl) - cloud level (level 0 is inserted for every source-dtime as a base layer of clear)
  • height (dbl) - level height (height -100 is inserted for every source-dtime as a base layer of clear)
  • type (dbl) - level type (type CLR is inserted for every source-dtime as a base layer of clear)

metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds

  • Contains elements for each of rain/snow/thunder for each of 2015/2016/2017
  • Each element contains a list and a tibble
  • The tibble is precipLength and contains precipitation by month as source-locale-month-hours-events
  • The list is precipList and contains data on each precipitation interval

Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:

# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"


# Sourcing functions
source("./WeatherModelingFunctions_v001.R")


# Descriptive names for key variables
varMapper <- c(source="Original source file name", 
               locale="Descriptive name",
               dtime="Date-Time (UTC)",
               origMETAR="Original METAR",
               year="Year",
               monthint="Month",
               month="Month", 
               day="Day of Month",
               WindDir="Wind Direction (degrees)", 
               WindSpeed="Wind Speed (kts)",
               WindGust="Wind Gust (kts)",
               predomDir="General Prevailing Wind Direction",
               Visibility="Visibility (SM)", 
               Altimeter="Altimeter (inches Hg)",
               TempF="Temperature (F)",
               DewF="Dew Point (F)", 
               modSLP="Sea-Level Pressure (hPa)", 
               cType1="First Cloud Layer Type", 
               cLevel1="First Cloud Layer Height (ft)",
               isRain="Rain at Observation Time",
               isSnow="Snow at Observation Time",
               isThunder="Thunder at Obsevation Time",
               tempFHi="24-hour High Temperature (F)",
               tempFLo="24-hour Low Temperature (F)",
               minHeight="Minimum Cloud Height (ft)",
               minType="Obscuration Level at Minimum Cloud Height",
               ceilingHeight="Minimum Ceiling Height (ft)",
               ceilingType="Obscuration Level at Minimum Ceiling Height", 
               hr="Hour of Day (Zulu time)",
               hrfct="Hour of Day (Zulu time)",
               hrBucket="Hour of Day (Zulu time) - rounded to nearest 3",
               locNamefct="Locale Name"
               )


# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
                    kbos_2016="Boston, MA (2016)", 
                    kdca_2016="Washington, DC (2016)", 
                    kden_2016="Denver, CO (2016)", 
                    kdfw_2016="Dallas, TX (2016)", 
                    kdtw_2016="Detroit, MI (2016)", 
                    kewr_2016="Newark, NJ (2016)",
                    kgrb_2016="Green Bay, WI (2016)",
                    kgrr_2016="Grand Rapids, MI (2016)",
                    kiah_2016="Houston, TX (2016)",
                    kind_2016="Indianapolis, IN (2016)",
                    klas_2014="Las Vegas, NV (2014)",
                    klas_2015="Las Vegas, NV (2015)",
                    klas_2016="Las Vegas, NV (2016)", 
                    klas_2017="Las Vegas, NV (2017)", 
                    klas_2018="Las Vegas, NV (2018)",
                    klas_2019="Las Vegas, NV (2019)",
                    klax_2016="Los Angeles, CA (2016)", 
                    klnk_2016="Lincoln, NE (2016)",
                    kmia_2016="Miami, FL (2016)", 
                    kmke_2016="Milwaukee, WI (2016)",
                    kmsn_2016="Madison, WI (2016)",
                    kmsp_2016="Minneapolis, MN (2016)",
                    kmsy_2014="New Orleans, LA (2014)",
                    kmsy_2015="New Orleans, LA (2015)",
                    kmsy_2016="New Orleans, LA (2016)", 
                    kmsy_2017="New Orleans, LA (2017)", 
                    kmsy_2018="New Orleans, LA (2018)",
                    kmsy_2019="New Orleans, LA (2019)",
                    kord_2014="Chicago, IL (2014)",
                    kord_2015="Chicago, IL (2015)",
                    kord_2016="Chicago, IL (2016)", 
                    kord_2017="Chicago, IL (2017)", 
                    kord_2018="Chicago, IL (2018)",
                    kord_2019="Chicago, IL (2019)",
                    kphl_2016="Philadelphia, PA (2016)", 
                    kphx_2016="Phoenix, AZ (2016)", 
                    ksan_2014="San Diego, CA (2014)",
                    ksan_2015="San Diego, CA (2015)",
                    ksan_2016="San Diego, CA (2016)",
                    ksan_2017="San Diego, CA (2017)",
                    ksan_2018="San Diego, CA (2018)",
                    ksan_2019="San Diego, CA (2019)",
                    ksat_2016="San Antonio, TX (2016)", 
                    ksea_2016="Seattle, WA (2016)", 
                    ksfo_2016="San Francisco, CA (2016)", 
                    ksjc_2016="San Jose, CA (2016)",
                    kstl_2016="Saint Louis, MO (2016)", 
                    ktpa_2016="Tampa Bay, FL (2016)", 
                    ktvc_2016="Traverse City, MI (2016)"
                    )

# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)

The main data will be from the metar_postEDA files. They are integrated below, and cloud and ceiling heights are converted to factors:

# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
    bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
    mutate(orig_minHeight=minHeight, 
           orig_ceilingHeight=ceilingHeight, 
           minHeight=mapCloudHeight(minHeight), 
           ceilingHeight=mapCloudHeight(ceilingHeight)
           )
glimpse(metarData)
## Observations: 437,417
## Variables: 43
## $ source             <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale             <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime              <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR          <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year               <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month              <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir            <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed          <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust           <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir          <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility         <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter          <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF              <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF               <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP             <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1             <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2             <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6             <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1            <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi            <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo            <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight          <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType            <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight      <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType        <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight     <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...

Initial Exploration

An initial exploration of the data can use hierarchical clustering based on several numerical features of the data (temperature, dew point, sea-level pressure, wind speed) by month.

Example code includes:

# Create hierarchical clustering for 2016 data
# Distance is calculated only where month is the same
distData <- metarData %>% 
    filter(year==2016) %>%
    select(locale, month, TempF, DewF, modSLP, WindSpeed) %>%
    group_by(locale, month) %>% 
    summarize_if(is.numeric, mean, na.rm=TRUE) %>% 
    ungroup() %>% 
    mutate(rowname=paste0(locale, "_", month)) %>% 
    column_to_rownames() %>% 
    select(-locale, -month) %>% 
    as.matrix() %>% 
    scale() %>% 
    dist() %>% 
    as.matrix() %>% 
    as.data.frame() %>%
    rownames_to_column(var="locale1") %>% 
    pivot_longer(-locale1, names_to="locale2", values_to="dist") %>% 
    mutate(city1=str_replace(locale1, "_\\w{3}", ""), 
           city2=str_replace(locale2, "_\\w{3}", ""), 
           month1=str_sub(locale1, -3), 
           month2=str_sub(locale2, -3)) %>%
    filter(month1==month2) %>% 
    group_by(city1, city2) %>% 
    summarize(meandist=mean(dist), sddist=sd(dist)) %>% 
    ungroup() %>% 
    select(-sddist) %>% 
    pivot_wider(city1, names_from="city2", values_from="meandist") %>%
    column_to_rownames(var="city1") %>%
    as.matrix() %>%
    as.dist(diag=FALSE)

distData %>%
    hclust(method="complete") %>%
    plot()

distData %>%
    hclust(method="single") %>%
    plot()

At a glance, there are several sensible findings from the clustering:

  • Two combinations of cities in close geographic proximity (Newark and Philadelphia; Chicago and Milaukee) show the greatest similarity in the clustering
  • A cluster of coastal California cities (San Diego, Los Angeles, San Jose) show strong similarities
  • A cluster of hot-humid cities (Houston-New Orleans and then soon Miami-Tampa) show strong similarities
  • Several clusters of cold-weather cities emerge
  • Two desert cities (Las Vega, Phoenix) show similarities

There are a handful of cities that do not seem to cluster with anything else:

  • Denver, Boston, Seattle, San Francisco

Temperature and Dew Point

Suppose that the only information available about two cities were their temperatures and dew points, with month also included. How well would a basic random forest, with mtry=3, classify the cities?

The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:

# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))

set.seed(2006281443)

n <- 1
for (ctr in 1:(length(names_2016)-1)) {
    for (ctr2 in (ctr+1):length(names_2016)) {
        list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData, 
                                                        loc1=names_2016[ctr], 
                                                        loc2=names_2016[ctr2], 
                                                        vrbls=c("TempF", "DewF", "month"),
                                                        ntree=25
                                                        )
        n <- n + 1
        if ((n %% 50) == 0) { cat("Through number:", n, "\n")}
    }
}
## Through number: 50 
## Through number: 100 
## Through number: 150 
## Through number: 200 
## Through number: 250 
## Through number: 300 
## Through number: 350 
## Through number: 400

Accuracy can then be assessed:

# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)

# Assess the top 10 classification accuracies
acc_TempF_DewF_month %>%
    arrange(-accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1              locale2                 accOverall accLocale1 accLocale2
##    <chr>                <chr>                        <dbl>      <dbl>      <dbl>
##  1 Denver, CO (2016)    Miami, FL (2016)             0.997      0.997      0.997
##  2 Las Vegas, NV (2016) Miami, FL (2016)             0.990      0.991      0.989
##  3 Miami, FL (2016)     Seattle, WA (2016)           0.985      0.985      0.985
##  4 Denver, CO (2016)    Tampa Bay, FL (2016)         0.984      0.986      0.982
##  5 Miami, FL (2016)     Traverse City, MI (201~      0.980      0.981      0.980
##  6 Miami, FL (2016)     Minneapolis, MN (2016)       0.978      0.978      0.978
##  7 Boston, MA (2016)    Miami, FL (2016)             0.976      0.975      0.976
##  8 Denver, CO (2016)    New Orleans, LA (2016)       0.975      0.975      0.975
##  9 Miami, FL (2016)     Phoenix, AZ (2016)           0.974      0.973      0.975
## 10 Green Bay, WI (2016) Miami, FL (2016)             0.971      0.972      0.971
# Assess the bottom 10 classification accuracies
acc_TempF_DewF_month %>%
    arrange(accOverall) %>%
    head(10)
## # A tibble: 10 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Newark, NJ (2016)      Philadelphia, PA (20~      0.580      0.591      0.570
##  2 Chicago, IL (2016)     Milwaukee, WI (2016)       0.591      0.600      0.583
##  3 Chicago, IL (2016)     Madison, WI (2016)         0.598      0.622      0.573
##  4 Detroit, MI (2016)     Grand Rapids, MI (20~      0.608      0.568      0.649
##  5 Chicago, IL (2016)     Detroit, MI (2016)         0.610      0.633      0.587
##  6 Grand Rapids, MI (201~ Traverse City, MI (2~      0.613      0.630      0.597
##  7 Madison, WI (2016)     Milwaukee, WI (2016)       0.618      0.625      0.611
##  8 Green Bay, WI (2016)   Madison, WI (2016)         0.619      0.586      0.652
##  9 Detroit, MI (2016)     Madison, WI (2016)         0.619      0.639      0.598
## 10 Chicago, IL (2016)     Grand Rapids, MI (20~      0.621      0.647      0.595
allAccuracy_month <- select(acc_TempF_DewF_month, 
                            locale=locale1, 
                            other=locale2, 
                            accOverall, 
                            accLocale=accLocale1
                            ) %>%
    bind_rows(select(acc_TempF_DewF_month, 
                     locale=locale2, 
                     other=locale1, 
                     accOverall, 
                     accLocale=accLocale2
                     )
              )

# Overall accuracy by location plot
allAccuracy_month %>%
    group_by(locale) %>%
    summarize_if(is.numeric, mean) %>%
    ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) + 
    geom_point(size=2) + 
    geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
    coord_flip() + 
    labs(x="", 
         y="Average Accuracy", 
         title="Average Accuracy Predicting Locale",
         subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
         caption="Temperature, Dew Point, Month as predictors\n(50% is baseline coinflip accuracy)"
         ) + 
    ylim(c(0.5, 1))

# Overall accuracy heatmap (FIX for better ordering of locales)
# allAccuracy_month %>% 
#     ggplot(aes(x=locale, y=other)) + 
#     geom_tile(aes(fill=accOverall)) + 
#     theme(axis.text.x=element_text(angle=90)) + 
#     scale_fill_continuous("Accuracy", high="darkblue", low="white") + 
#     labs(title="Accuracy Predicting Locale vs. Locale", 
#          caption="Temperature, Dew Point, and Month as predictors\n(50% is baseline coinflip accuracy)",
#          x="",
#          y=""
#          )

Accuracy is best for cities with significantly different locales. The model is especially successful at pulling apart the desert cities (Las Vegas, Phoenix) from everything else, and the humid Florida cities (Miami, Tampa Bay) from everything else.

Next, the simple model is run to classify locale across the full 2016 dataset:

# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData, 
                                 vrbls=c("TempF", "DewF", "month"),
                                 locs=names_2016, 
                                 ntree=50, 
                                 seed=2006281508
                                 )

Summaries can then be created for the accuracy in predicting each locale:

evalPredictions(rf_all_2016_TDm, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")

## # A tibble: 898 x 5
##    locale             predicted               correct     n    pct
##    <fct>              <fct>                   <lgl>   <int>  <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE      484 0.182 
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      49 0.0185
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      39 0.0147
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     179 0.0675
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      34 0.0128
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      42 0.0158
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      48 0.0181
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      42 0.0158
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      96 0.0362
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      49 0.0185
## # ... with 888 more rows

Accuracy is meaningfully increased compared to the null accuracy, though there is still under 50% accuracy in classifying any locale. This is suggestive that the goal will be to define some archetype climates, and to use those for predicting (e.g., humid, cold, desert, etc.).

Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart various marine and mid-continental climates.

An updated random forest model is then run:

# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(metarData, 
                                  vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
                                  locs=names_2016, 
                                  ntree=50, 
                                  seed=2006281509
                                  )

The evaluation process is converted to a function:

evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")

## # A tibble: 897 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE      643 0.240  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      84 0.0313 
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      61 0.0227 
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     145 0.0541 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      44 0.0164 
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      51 0.0190 
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      49 0.0183 
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      26 0.00969
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      82 0.0306 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE     100 0.0373 
## # ... with 887 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDm, currObject=rf_all_2016_TDmc)

Accuracy increases meaningfully, though still well under 50% in most cases.

The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:

# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(metarData, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", 
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=names_2016, 
                                   ntree=50, 
                                   seed=2006281934
                                   )

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcw, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
                )

## # A tibble: 889 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE     1096 0.415  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      34 0.0129 
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      46 0.0174 
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE     135 0.0511 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      28 0.0106 
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      31 0.0117 
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      26 0.00984
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE      24 0.00908
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      58 0.0219 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      59 0.0223 
## # ... with 879 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmc, currObject=rf_all_2016_TDmcw)

Including wind significantly improves model accuracy for many locales. Even the cold weather cities are now being predicted with around 30%-40% accucacy.

The model can also be built out to consider hour of day and sea-level pressure. No attempt yet is made to control for over-fitting, with the exception that mtry is held at 4:

# Run random forest for all 2016 locales
rf_all_2016_TDmcwa <- rfMultiLocale(metarData, 
                                    vrbls=c("TempF", "DewF", 
                                            "month",
                                            "minHeight", "ceilingHeight", 
                                            "WindSpeed", "predomDir",
                                            "modSLP"
                                            ),
                                    locs=names_2016, 
                                    ntree=50, 
                                    mtry=4,
                                    seed=2006282103
                                    )

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcwa, 
                plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 880 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE     1639 0.625  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      16 0.00610
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      23 0.00877
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE      75 0.0286 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE       8 0.00305
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      18 0.00686
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE      12 0.00457
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE       6 0.00229
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      45 0.0172 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      51 0.0194 
## # ... with 870 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmcw, currObject=rf_all_2016_TDmcwa)

Adding sea-level pressure again significantly improves prediction ability. The cold weather cities are in the roughly 40%-60% accuracy range, and the other cities are in the roughly 60%-80% accuracy range.

Based on the hierarchical clustering and results of the initial random forest models, a few archetype cities are defined in an attempt to pull out distinctions:

  • Miami (likely to match with Tampa, New Orleans, and Houston)
  • Las Vegas (likely to match with Phoenix)
  • Los Angeles (likely to match with San Diego and San Jose)
  • Denver (seems potentially isolated)
  • Seattle (seems potentially isolated)
  • Dallas (likely to match to San Antonio)
  • Minneapolis (surrogate for the cold side of cold weather cities)
  • Atlanta (possible to match with DC, St Louis, Indianapolis, but may be too similar to Cold/Miami)

A data subset is created, with “hr” added for the Zulu hour of the observation (as integer rather than as factor):

archeCities <- c("kmia_2016", "klas_2016", "klax_2016", "kden_2016", 
                 "ksea_2016", "kdfw_2016", "kmsp_2016", "katl_2016"
                 )

archeData <- metarData %>%
    filter(source %in% archeCities) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales
rf_arche_2016_TDmcwa <- rfMultiLocale(archeData, 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP"
                                              ),
                                      locs=NULL, # defaults to running for all locales
                                      ntree=100, 
                                      mtry=4,
                                      seed=2006291353
                                      )
## 
## Running for locations:
## [1] "katl_2016" "kden_2016" "kdfw_2016" "klas_2016" "klax_2016" "kmia_2016"
## [7] "kmsp_2016" "ksea_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 62 x 5
##    locale             predicted              correct     n    pct
##    <fct>              <fct>                  <lgl>   <int>  <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)     TRUE     2204 0.835 
##  2 Atlanta, GA (2016) Dallas, TX (2016)      FALSE     139 0.0527
##  3 Atlanta, GA (2016) Denver, CO (2016)      FALSE      39 0.0148
##  4 Atlanta, GA (2016) Las Vegas, NV (2016)   FALSE      33 0.0125
##  5 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE      86 0.0326
##  6 Atlanta, GA (2016) Miami, FL (2016)       FALSE      50 0.0189
##  7 Atlanta, GA (2016) Minneapolis, MN (2016) FALSE      45 0.0170
##  8 Atlanta, GA (2016) Seattle, WA (2016)     FALSE      44 0.0167
##  9 Dallas, TX (2016)  Atlanta, GA (2016)     FALSE     203 0.0789
## 10 Dallas, TX (2016)  Dallas, TX (2016)      TRUE     2105 0.818 
## # ... with 52 more rows

The model is reasonably successful in pulling apart the archetypes. Denver and Dallas seem potentially less distinct, though all archetypes cities are predicted at 80%+ accuracy.

The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa, 
                  fullData=metarData, 
                  archeCities=archeCities, 
                  sortDescMatch=TRUE
                  )

There appear to be several issues:

  • Atlanta and Dallas are not particularly distinct and do not make for good archetypes (potentially, Miami, Atlanta, and Dallas can all be replaced by Houston); Denver and Seattle are shaky but may be OK
  • Minneapolis may be too cold, driving the midwestern cities to instead classify partially as Denver
  • The model is potentially learning too much about the specific city rather than about archetypes

The first issue is addressed by collapsing Atlanta, Dallas, and Miami and instead using Houston as the archetype; and replacing Minneapolis with Chicago:

archeCities_v02 <- c("kiah_2016", "klas_2016", "klax_2016", 
                     "kden_2016", "ksea_2016", "kord_2016"
                     )

archeData_v02 <- metarData %>%
    filter(source %in% archeCities_v02) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales in archeData_v02
rf_arche_2016_TDmcwa_v02 <- rfMultiLocale(archeData_v02, 
                                          vrbls=c("TempF", "DewF", 
                                                  "month", "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir",
                                                  "modSLP"
                                                  ),
                                          locs=NULL, # defaults to running for all locales
                                          ntree=100, 
                                          mtry=4,
                                          seed=2006291448
                                          )
## 
## Running for locations:
## [1] "kden_2016" "kiah_2016" "klas_2016" "klax_2016" "kord_2016" "ksea_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v02, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )

## # A tibble: 36 x 5
##    locale             predicted              correct     n     pct
##    <fct>              <fct>                  <lgl>   <int>   <dbl>
##  1 Chicago, IL (2016) Chicago, IL (2016)     TRUE     2321 0.872  
##  2 Chicago, IL (2016) Denver, CO (2016)      FALSE      99 0.0372 
##  3 Chicago, IL (2016) Houston, TX (2016)     FALSE      54 0.0203 
##  4 Chicago, IL (2016) Las Vegas, NV (2016)   FALSE      13 0.00489
##  5 Chicago, IL (2016) Los Angeles, CA (2016) FALSE      70 0.0263 
##  6 Chicago, IL (2016) Seattle, WA (2016)     FALSE     104 0.0391 
##  7 Denver, CO (2016)  Chicago, IL (2016)     FALSE     119 0.0453 
##  8 Denver, CO (2016)  Denver, CO (2016)      TRUE     2282 0.869  
##  9 Denver, CO (2016)  Houston, TX (2016)     FALSE       3 0.00114
## 10 Denver, CO (2016)  Las Vegas, NV (2016)   FALSE     126 0.0480 
## # ... with 26 more rows

The model is reasonably successful in pulling apart the archetypes. Houston appears the most distinct (as had Miami previously), and there is less overlap going to/from Denver or Seattle.

The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa_v02, 
                  fullData=metarData, 
                  archeCities=archeCities_v02, 
                  sortDescMatch = TRUE
                  )

At a glance, the archetypes are more sensible. There is a bit of Denver and a bit of Seattle in most of the cold weather cities, and a bit of Houston in some of the warmer cold weather cities. This is suggestive that there are either more then one type of cold-weather cities, or that cold-weather cities tend to be like other archetypes are certain times.

The model is run again, deleting Denver and Seattle, and converting Los Angeles to San Diego:

archeCities_v03 <- c("kiah_2016", "klas_2016", "ksan_2016", "kord_2016")

archeData_v03 <- metarData %>%
    filter(source %in% archeCities_v03) %>%
    mutate(hr=lubridate::hour(dtime))

A random forest, with mtry=4, is then run using all variables from previous, as well as hour:

# Run random forest for all 2016 locales in archeData_v03
rf_arche_2016_TDmcwa_v03 <- rfMultiLocale(archeData_v03, 
                                          vrbls=c("TempF", "DewF", 
                                                  "month", "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir",
                                                  "modSLP"
                                                  ),
                                          locs=NULL, # defaults to running for all locales
                                          ntree=100, 
                                          mtry=4,
                                          seed=2006291459
                                          )
## 
## Running for locations:
## [1] "kiah_2016" "klas_2016" "kord_2016" "ksan_2016"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v03, 
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 16 x 5
##    locale               predicted            correct     n     pct
##    <fct>                <fct>                <lgl>   <int>   <dbl>
##  1 Chicago, IL (2016)   Chicago, IL (2016)   TRUE     2511 0.946  
##  2 Chicago, IL (2016)   Houston, TX (2016)   FALSE      69 0.0260 
##  3 Chicago, IL (2016)   Las Vegas, NV (2016) FALSE      26 0.00979
##  4 Chicago, IL (2016)   San Diego, CA (2016) FALSE      49 0.0185 
##  5 Houston, TX (2016)   Chicago, IL (2016)   FALSE      44 0.0170 
##  6 Houston, TX (2016)   Houston, TX (2016)   TRUE     2473 0.954  
##  7 Houston, TX (2016)   Las Vegas, NV (2016) FALSE      19 0.00733
##  8 Houston, TX (2016)   San Diego, CA (2016) FALSE      57 0.0220 
##  9 Las Vegas, NV (2016) Chicago, IL (2016)   FALSE      29 0.0113 
## 10 Las Vegas, NV (2016) Houston, TX (2016)   FALSE      13 0.00508
## 11 Las Vegas, NV (2016) Las Vegas, NV (2016) TRUE     2482 0.971  
## 12 Las Vegas, NV (2016) San Diego, CA (2016) FALSE      33 0.0129 
## 13 San Diego, CA (2016) Chicago, IL (2016)   FALSE      35 0.0130 
## 14 San Diego, CA (2016) Houston, TX (2016)   FALSE      29 0.0108 
## 15 San Diego, CA (2016) Las Vegas, NV (2016) FALSE      63 0.0235 
## 16 San Diego, CA (2016) San Diego, CA (2016) TRUE     2559 0.953

These archetypes are distinct, with accuracies all in the 95% range. The remaining cities can be classified against the archetype data, to explore any patterns that emerge:

localeByArchetype(rf_arche_2016_TDmcwa_v03,
                  fullData=metarData,
                  archeCities=archeCities_v03,
                  sortDescMatch=TRUE
                  )

Quite a few cities show as blends of the archetype cities, which is likely correct if archetypes are considered to be Cold, Humid, Marine, Desert. An open question is whether to expand that list with another 1-2 archetypes that attempt to better capture Atlanta, Dallas, Denver, St Louis, and DC.

The Marine archetype may need to be converted to Pacific and to better identify the California entities.

A decision about Seattle is needed, as it is a little like everything and everything is a little like it.

To address the issue of the model learning from specific cities rather than archetypes, user-defined clusters will be created, and data from these clusters will be used in modeling:

  • Marine - San Diego, Los Angeles in an attempt to train the model to recognize that Marine cities differ meaningfully from Cold cities and Humid cities (San Jose, San Francicso and Seattle are somewhat different climates)
  • Desert - Las Vegas, Phoenix in an attempt to train the model on differences in Desert and Marine
  • Humid - Houston, Miami, New Orleans, Tampa, reflecting that the model is already largely pulling out the humid cities correctly
  • Cold - Chicago, Milwaukee, Grand Rapids, Green Bay, Traverse City, Madison, Detroit, Minneapolis to reflect that the model is largely already pulling out cold cities correctly (Boston excluded for now)
  • South - Atlanta, Dallas, San Antonio in an attempt to pull out southern cities that are not as humid as those along the Gulf Coast
  • Mixed - Indianapolis, Lincoln, St Louis, to reflect cities that have milder winters than Cold but colder winters than South
  • Excluded for now - Denver, Seattle, Boston/Newark/Phialdelphia/DC, San Francisco/San Jose

Further, a variable will be created for “hr”, the Zulu hour of the observation:

# Create the locale mapper
locMapperTibble <- tibble::tribble(
    ~source, ~locType,
    'katl_2016', 'South',
    'kbos_2016', 'Exclude',
    'kdca_2016', 'Exclude',
    'kden_2016', 'Exclude',
    'kdfw_2016', 'South',
    'kdtw_2016', 'Cold',
    'kewr_2016', 'Exclude',
    'kgrb_2016', 'Cold',
    'kgrr_2016', 'Cold',
    'kiah_2016', 'Humid',
    'kind_2016', 'Mixed',
    'klas_2016', 'Desert',
    'klax_2016', 'Marine',
    'klnk_2016', 'Mixed',
    'kmia_2016', 'Humid',
    'kmke_2016', 'Cold',
    'kmsn_2016', 'Cold',
    'kmsp_2016', 'Cold',
    'kmsy_2016', 'Humid',
    'kord_2016', 'Cold',
    'kphl_2016', 'Exclude',
    'kphx_2016', 'Desert',
    'ksan_2016', 'Marine',
    'ksat_2016', 'South',
    'ksea_2016', 'Exclude',
    'ksfo_2016', 'Exclude',
    'ksjc_2016', 'Exclude',
    'kstl_2016', 'Mixed',
    'ktpa_2016', 'Humid',
    'ktvc_2016', 'Cold'
)

# Create locMapper
locMapper <- locMapperTibble %>% pull(locType)
names(locMapper) <- locMapperTibble %>% pull(source)
locMapper
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016 
##   "South" "Exclude" "Exclude" "Exclude"   "South"    "Cold" "Exclude"    "Cold" 
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016 
##    "Cold"   "Humid"   "Mixed"  "Desert"  "Marine"   "Mixed"   "Humid"    "Cold" 
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016 
##    "Cold"    "Cold"   "Humid"    "Cold" "Exclude"  "Desert"  "Marine"   "South" 
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016 
## "Exclude" "Exclude" "Exclude"   "Mixed"   "Humid"    "Cold"
# Create the data file with locType
locData_v04 <- metarData %>% 
    mutate(locType=locMapper[source], hr=lubridate::hour(dtime)) %>% 
    filter(year==2016, locType !="Exclude")

# Counts by locType
locData_v04 %>%
    count(locType)
## # A tibble: 6 x 2
##   locType     n
##   <chr>   <int>
## 1 Cold    70101
## 2 Desert  17543
## 3 Humid   35074
## 4 Marine  17536
## 5 Mixed   26187
## 6 South   26309

There is a risk that the model will predict neutral cities as ‘Cold’ given the 4:1 ratio of Cold cities to Marine/Desert cities. The random forest model is first run on the data ‘as is’:

rf_arche_small_TDmcwa_v04 <- rfMultiLocale(locData_v04, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", "hr",
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir",
                                                   "modSLP"
                                                   ),
                                           locs=NULL, # defaults to running for all locales
                                           ntree=100, 
                                           locVar="locType", 
                                           pred="locType",
                                           mtry=4, 
                                           otherVar=c("dtime", "source", "locale"),
                                           seed=2006301313
                                           )
## 
## Running for locations:
## [1] "Cold"   "Desert" "Humid"  "Marine" "Mixed"  "South"

Prediction accuracy is then assessed:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v04,
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
                keyVar="locType"
                )

## # A tibble: 36 x 5
##    locale predicted correct     n     pct
##    <fct>  <fct>     <lgl>   <int>   <dbl>
##  1 Cold   Cold      TRUE    19819 0.947  
##  2 Cold   Desert    FALSE      37 0.00177
##  3 Cold   Humid     FALSE     119 0.00568
##  4 Cold   Marine    FALSE     159 0.00759
##  5 Cold   Mixed     FALSE     549 0.0262 
##  6 Cold   South     FALSE     254 0.0121 
##  7 Desert Cold      FALSE     122 0.0234 
##  8 Desert Desert    TRUE     4790 0.921  
##  9 Desert Humid     FALSE      41 0.00788
## 10 Desert Marine    FALSE     115 0.0221 
## # ... with 26 more rows

The designation of Cold, Desert, Humid, and Marine seems successful. The attempt to split South from Humid resulted in good, if incomplete, separation. The attempt to separate a group of ‘warmer’ cold-weather cities from ‘colder’ cold-weather cities was not particularly successful. This could partially be an artifact of ‘Cold’ having double the data volume as ‘Mixed’.

Classifications by city can also be examined:

archeCities_v04 <- locMapperTibble %>%
    filter(locType != "Exclude") %>%
    pull(source)

localeByArchetype(rf_arche_small_TDmcwa_v04, 
                  fullData=mutate(metarData, locType=locMapper[source]), 
                  archeCities=archeCities_v04, 
                  sortDescMatch=TRUE
                  )

Many of the cities are nicely classified in to their assigned archetypes, as desired. Among the cities used in the classifications, concerns include:

  • The cities pre-assigned as South do not seem to cluster so strongly together. This may not be a good archetype
  • The cities pre-assigned as Mixed do not clutser well, possibly driven by a data volume discrepancy where Cold has more than double the data
  • Remaining cities frequently assign to 2+ archetypes, as expected since weather in the mid-latitudes would not necessarily follow any archetypal pattern on a year-long basis

Classifications are changed somewhat, and data are then filtered so that an equal number of observations from each locale type are applied in the modeling:

  • Mixed - deleted, entries will be moved to ‘Exclude’ to see how they map
  • South - deleted, entries will be moved to ‘Exclude’ to see how they map
# Create the locale mapper
locMapperTibble_v05 <- tibble::tribble(
    ~source, ~locType,
    'katl_2016', 'Exclude',
    'kbos_2016', 'Exclude',
    'kdca_2016', 'Exclude',
    'kden_2016', 'Exclude',
    'kdfw_2016', 'Exclude',
    'kdtw_2016', 'Cold',
    'kewr_2016', 'Exclude',
    'kgrb_2016', 'Cold',
    'kgrr_2016', 'Cold',
    'kiah_2016', 'Humid',
    'kind_2016', 'Exclude',
    'klas_2016', 'Desert',
    'klax_2016', 'Marine',
    'klnk_2016', 'Exclude',
    'kmia_2016', 'Humid',
    'kmke_2016', 'Cold',
    'kmsn_2016', 'Cold',
    'kmsp_2016', 'Cold',
    'kmsy_2016', 'Humid',
    'kord_2016', 'Cold',
    'kphl_2016', 'Exclude',
    'kphx_2016', 'Desert',
    'ksan_2016', 'Marine',
    'ksat_2016', 'Exclude',
    'ksea_2016', 'Exclude',
    'ksfo_2016', 'Exclude',
    'ksjc_2016', 'Exclude',
    'kstl_2016', 'Exclude',
    'ktpa_2016', 'Humid',
    'ktvc_2016', 'Cold'
)

# Create locMapper
locMapper_v05 <- locMapperTibble_v05 %>% pull(locType)
names(locMapper_v05) <- locMapperTibble_v05 %>% pull(source)
locMapper_v05
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016 
## "Exclude" "Exclude" "Exclude" "Exclude" "Exclude"    "Cold" "Exclude"    "Cold" 
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016 
##    "Cold"   "Humid" "Exclude"  "Desert"  "Marine" "Exclude"   "Humid"    "Cold" 
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016 
##    "Cold"    "Cold"   "Humid"    "Cold" "Exclude"  "Desert"  "Marine" "Exclude" 
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016 
## "Exclude" "Exclude" "Exclude" "Exclude"   "Humid"    "Cold"
# Create the data file with locType
locData_v05 <- metarData %>% 
    mutate(locType=locMapper_v05[source], hr=lubridate::hour(dtime)) %>% 
    filter(year==2016, locType !="Exclude")

# Counts by locType
locData_v05 %>%
    count(locType)
## # A tibble: 4 x 2
##   locType     n
##   <chr>   <int>
## 1 Cold    70101
## 2 Desert  17543
## 3 Humid   35074
## 4 Marine  17536

The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:

# Set a seed for reporducibility
set.seed(2006301356)

# Find the smallest locale type
nSmall <- locData_v05 %>%
    filter(!is.na(locType)) %>%
    count(locType) %>%
    pull(n) %>%
    min()

# Create the relevant data subset
subData_v05 <- locData_v05 %>%
    filter(!is.na(locType)) %>%
    group_by(locType) %>%
    sample_n(size=nSmall, replace=FALSE) %>%
    ungroup()

# Sumarize the data subset
subData_v05 %>% 
    count(locType, locale) %>% 
    arrange(-n)
## # A tibble: 16 x 3
##    locType locale                       n
##    <chr>   <chr>                    <int>
##  1 Marine  Los Angeles, CA (2016)    8774
##  2 Desert  Phoenix, AZ (2016)        8770
##  3 Desert  Las Vegas, NV (2016)      8766
##  4 Marine  San Diego, CA (2016)      8762
##  5 Humid   New Orleans, LA (2016)    4402
##  6 Humid   Miami, FL (2016)          4388
##  7 Humid   Houston, TX (2016)        4386
##  8 Humid   Tampa Bay, FL (2016)      4360
##  9 Cold    Grand Rapids, MI (2016)   2234
## 10 Cold    Milwaukee, WI (2016)      2231
## 11 Cold    Minneapolis, MN (2016)    2205
## 12 Cold    Madison, WI (2016)        2188
## 13 Cold    Traverse City, MI (2016)  2182
## 14 Cold    Detroit, MI (2016)        2173
## 15 Cold    Chicago, IL (2016)        2168
## 16 Cold    Green Bay, WI (2016)      2155

The previous model can then be run on the data subset:

rf_arche_small_TDmcwa_v05 <- rfMultiLocale(subData_v05, 
                                           vrbls=c("TempF", "DewF", 
                                                   "month", "hr",
                                                   "minHeight", "ceilingHeight", 
                                                   "WindSpeed", "predomDir",
                                                   "modSLP"
                                                   ),
                                           locs=NULL, # defaults to running for all locales
                                           ntree=100, 
                                           locVar="locType", 
                                           pred="locType",
                                           mtry=4, 
                                           otherVar=c("dtime", "source", "locale"),
                                           seed=2006301358
                                           )
## 
## Running for locations:
## [1] "Cold"   "Desert" "Humid"  "Marine"

The evaluation process can again be run:

# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v05,
                plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
                keyVar="locType"
                )

## # A tibble: 16 x 5
##    locale predicted correct     n     pct
##    <fct>  <fct>     <lgl>   <int>   <dbl>
##  1 Cold   Cold      TRUE     4864 0.931  
##  2 Cold   Desert    FALSE      68 0.0130 
##  3 Cold   Humid     FALSE     116 0.0222 
##  4 Cold   Marine    FALSE     175 0.0335 
##  5 Desert Cold      FALSE      58 0.0112 
##  6 Desert Desert    TRUE     4925 0.955  
##  7 Desert Humid     FALSE      34 0.00659
##  8 Desert Marine    FALSE     141 0.0273 
##  9 Humid  Cold      FALSE      99 0.0188 
## 10 Humid  Desert    FALSE      64 0.0121 
## 11 Humid  Humid     TRUE     4916 0.932  
## 12 Humid  Marine    FALSE     194 0.0368 
## 13 Marine Cold      FALSE      88 0.0166 
## 14 Marine Desert    FALSE     210 0.0396 
## 15 Marine Humid     FALSE      81 0.0153 
## 16 Marine Marine    TRUE     4920 0.928

The archetypes are well-separated, with roughly 95% accuracy in every case. Predictions can then also be made for all cities, including those not in the modeling:

archeCities_v05 <- locMapperTibble_v05 %>%
    filter(locType != "Exclude") %>%
    pull(source)

localeByArchetype(rf_arche_small_TDmcwa_v05, 
                  fullData=mutate(metarData, locType=locMapper_v05[source]), 
                  archeCities=archeCities_v05, 
                  sortDescMatch=TRUE
                  )

Broadly speaking, cities used in modeling appear to classify in to the appropriate archetypes. For the cities not included in the modeling:

  • Boston, Lincoln, Indianapolis, Newark, Philadelphia, and Seattle are most closely matched to ‘Cold’
  • San Antonio is most closely matched to ‘Humid’
  • Denver is a split of Cold/Desert
  • DC and St Louis are a split of Cold/Humid
  • San Francisco and San Jose are most closely matched to ‘Marine’ with meaningful ‘Cold’ and ‘Humid’ also
  • Atlanta and Dallas are roughly split between ‘Humid’ and ‘Cold’, with a bit of ‘Desert’ also

The model can be applied to data from years other than 2016, to see how the classifications are impacted by use in out-of-sample years:

# Predictions on non-2016 data
helperPredictPlot(rf_arche_small_TDmcwa_v05$rfModel, 
                  df=filter(mutate(metarData, hr=lubridate::hour(dtime)), year!=2016), 
                  predOrder=c("Marine", "Humid", "Desert", "Cold"), 
                  locMapper=locMapper_v05
                  )

## # A tibble: 80 x 4
##    locale             predicted     n    pct
##    <chr>              <fct>     <int>  <dbl>
##  1 Chicago, IL (2014) Cold       7928 0.915 
##  2 Chicago, IL (2014) Desert      273 0.0315
##  3 Chicago, IL (2014) Humid       144 0.0166
##  4 Chicago, IL (2014) Marine      318 0.0367
##  5 Chicago, IL (2015) Cold       7722 0.886 
##  6 Chicago, IL (2015) Desert      247 0.0283
##  7 Chicago, IL (2015) Humid       340 0.0390
##  8 Chicago, IL (2015) Marine      411 0.0471
##  9 Chicago, IL (2017) Cold       7660 0.878 
## 10 Chicago, IL (2017) Desert      466 0.0534
## # ... with 70 more rows

Model performance on non-2016 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the archetypes.

Suppose that models are run on all 2015-2018 data for Chicago, Las Vegas, New Orleans, and San Diego:

# Create the subset for Chicago, Las Vegas, New Orleans, San Diego
sub_2015_2018_data <- metarData %>%
    filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"), 
           year %in% c(2015, 2016, 2017, 2018)
           ) %>%
    mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""), 
           hr=lubridate::hour(dtime)
           )

# Check that proper locales are included
sub_2015_2018_data %>% 
    count(city, locale)
## # A tibble: 16 x 3
##    city            locale                     n
##    <chr>           <chr>                  <int>
##  1 Chicago, IL     Chicago, IL (2015)      8728
##  2 Chicago, IL     Chicago, IL (2016)      8767
##  3 Chicago, IL     Chicago, IL (2017)      8740
##  4 Chicago, IL     Chicago, IL (2018)      8737
##  5 Las Vegas, NV   Las Vegas, NV (2015)    8727
##  6 Las Vegas, NV   Las Vegas, NV (2016)    8770
##  7 Las Vegas, NV   Las Vegas, NV (2017)    8664
##  8 Las Vegas, NV   Las Vegas, NV (2018)    8746
##  9 New Orleans, LA New Orleans, LA (2015)  8727
## 10 New Orleans, LA New Orleans, LA (2016)  8767
## 11 New Orleans, LA New Orleans, LA (2017)  8723
## 12 New Orleans, LA New Orleans, LA (2018)  8737
## 13 San Diego, CA   San Diego, CA (2015)    8737
## 14 San Diego, CA   San Diego, CA (2016)    8762
## 15 San Diego, CA   San Diego, CA (2017)    8744
## 16 San Diego, CA   San Diego, CA (2018)    8744

The random forest model is run and cached:

# Run random forest for 2015-2018 data
rf_types_2015_2018_TDmcwha <- rfMultiLocale(sub_2015_2018_data, 
                                            vrbls=c("TempF", "DewF", 
                                                    "month", "hr",
                                                    "minHeight", "ceilingHeight", 
                                                    "WindSpeed", "predomDir", 
                                                    "modSLP"
                                                    ),
                                            locs=NULL, 
                                            locVar="city",
                                            pred="city",
                                            ntree=50, 
                                            seed=2006301420, 
                                            mtry=4
                                            )
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2018_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="city"
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 16 x 5
##    locale          predicted       correct     n     pct
##    <fct>           <fct>           <lgl>   <int>   <dbl>
##  1 Chicago, IL     Chicago, IL     TRUE     9847 0.941  
##  2 Chicago, IL     Las Vegas, NV   FALSE     121 0.0116 
##  3 Chicago, IL     New Orleans, LA FALSE     236 0.0226 
##  4 Chicago, IL     San Diego, CA   FALSE     258 0.0247 
##  5 Las Vegas, NV   Chicago, IL     FALSE     153 0.0146 
##  6 Las Vegas, NV   Las Vegas, NV   TRUE    10048 0.961  
##  7 Las Vegas, NV   New Orleans, LA FALSE      60 0.00574
##  8 Las Vegas, NV   San Diego, CA   FALSE     192 0.0184 
##  9 New Orleans, LA Chicago, IL     FALSE     224 0.0213 
## 10 New Orleans, LA Las Vegas, NV   FALSE      98 0.00932
## 11 New Orleans, LA New Orleans, LA TRUE     9859 0.938  
## 12 New Orleans, LA San Diego, CA   FALSE     329 0.0313 
## 13 San Diego, CA   Chicago, IL     FALSE     173 0.0166 
## 14 San Diego, CA   Las Vegas, NV   FALSE     281 0.0269 
## 15 San Diego, CA   New Orleans, LA FALSE     197 0.0189 
## 16 San Diego, CA   San Diego, CA   TRUE     9789 0.938

Even with a small forest (50 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.

How do other cities map against these classifications?

# Predictions on 2015/2018 data
helperPredictPlot(rf_types_2015_2018_TDmcwha$rfModel, 
                  df=filter(mutate(metarData, hr=lubridate::hour(dtime)), 
                            !(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))
                            ), 
                  predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
                  )

## # A tibble: 104 x 4
##    locale             predicted           n    pct
##    <chr>              <fct>           <int>  <dbl>
##  1 Atlanta, GA (2016) Chicago, IL      3363 0.384 
##  2 Atlanta, GA (2016) Las Vegas, NV     809 0.0924
##  3 Atlanta, GA (2016) New Orleans, LA  3709 0.424 
##  4 Atlanta, GA (2016) San Diego, CA     870 0.0994
##  5 Boston, MA (2016)  Chicago, IL      7710 0.891 
##  6 Boston, MA (2016)  Las Vegas, NV     430 0.0497
##  7 Boston, MA (2016)  New Orleans, LA   291 0.0336
##  8 Boston, MA (2016)  San Diego, CA     223 0.0258
##  9 Dallas, TX (2016)  Chicago, IL      2999 0.343 
## 10 Dallas, TX (2016)  Las Vegas, NV    1147 0.131 
## # ... with 94 more rows

Classifications are broadly as expected based on the previous archetype analysis. Variable importances are plotted:

helperPlotVarImp(rf_types_2015_2018_TDmcwha$rfModel)

Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.

An assessment can be run for the 2015-2018 model:

# Run for the full model including SLP
probs_2015_2018_TDmcwha <- 
    assessPredictionCertainty(rf_types_2015_2018_TDmcwha, 
                              keyVar="city", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP", 
                              showAcc=TRUE
                              )
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

  • Predictions with 80%+ of the votes are made ~75% of the time, and these predictions are ~99% accurate
  • Predictions with <80% of the votes are made ~25% of the times, and these predictions are ~80% accurate
  • The percentage of votes received appears to be a reasonable proxy for the confidence of the prediction

A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:

useData <- metarData %>%
    filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))) %>%
    mutate(hr=lubridate::hour(dtime))
    
# Run for the model excluding SLP
probs_allcities_2015_2018_TDmcwh <- 
    assessPredictionCertainty(rf_types_2015_2018_TDmcwha, 
                              testData=useData,
                              keyVar="locale", 
                              plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, modSLP", 
                              showHists=TRUE
                              )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The model is frequently not so confident in assigning an archetype to related cities, though it frequently gets the most sensible assignment.

An assessment is run to look at the evolution of the model as it take in the ‘next best’ variable for building out the random forest:

# Define the variables to be considered
possVars <- c("TempF", "DewF", "month", "hr", "minHeight", 
              "ceilingHeight", "WindSpeed", "predomDir", "modSLP"
              )

# Create a container for storing the best random forest and variable added from each run
rfContainer <- vector("list", length=length(possVars))
rfVarAdds <- vector("character", length=length(possVars))

# Run the functions using a for loop over the length of possVars
for (ctr in 1:length(possVars)) {
    
    # Pull in the results of the previous run
    if (ctr==1) {
        prevVars <- c()
    } else {
        prevVars <- rfVarAdds[1:(ctr-1)]
    }
    
    # Run each of them through the combinations
    tmpList <- helperRFCombinations(possVars, df=sub_2015_2018_data, prevVars=prevVars)

    # Assess the performance
    tmpAccuracy <- helperVariableAccuracy(tmpList, possVars=possVars, prevVars=prevVars)
    
    # Prepare to repeat the process
    bestRow <- tmpAccuracy %>%
        filter(locale=="OOB") %>%
        filter(accuracy==max(accuracy))
    bestRow
    
    # Update the rfContainer and rfVarAdds elements
    rfContainer[[ctr]] <- tmpList[[as.integer(pull(bestRow, vrblNum))]]
    rfVarAdds[ctr] <- pull(bestRow, vrblName)
    cat("\nVariable Added:", rfVarAdds[ctr], "\n")

}
## 
## Will run for: TempF DewF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be:  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: DewF 
## 
## Will run for: TempF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: TempF 
## 
## Will run for: month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: month 
## 
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF month 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: modSLP 
## 
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir
## Fixed variable(s) will be: DewF TempF month modSLP 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: predomDir 
## 
## Will run for: hr minHeight ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: minHeight 
## 
## Will run for: hr ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: hr 
## 
## Will run for: ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"  
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: WindSpeed 
## 
## Will run for: ceilingHeight
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr WindSpeed 
## 
## Running for locations:
## [1] "Chicago, IL"     "Las Vegas, NV"   "New Orleans, LA" "San Diego, CA"

## 
## Variable Added: ceilingHeight

The evolution of accuracy can then be plotted:

# Pull the accuracy data from the variables selected
tblAccuracy <- map_dfr(rfContainer, .f=helperExtractAccuracy, .id="vrblNum") %>%
    mutate(vrblName=rfVarAdds[as.integer(vrblNum)], 
           locale=factor(locale, levels=c("OOB", unique(locale)[unique(locale) != "OOB"])), 
           plotLabel=ifelse(as.integer(vrblNum)==1, vrblName, paste0("add ", vrblName))
           )

# Plot the gains in accuracy, facetted by 'locale'
ggplot(tblAccuracy, aes(x=fct_reorder(plotLabel, -as.integer(vrblNum)))) + 
    geom_text(aes(y=accuracy, label=paste0(round(100*accuracy), "%"))) + 
    facet_wrap(~locale, nrow=1) + 
    ylim(c(0, 1)) +
    geom_hline(yintercept=0.25, lty=2) +
    labs(x="", 
         y="", 
         title="Evolution of accuracy as next-best variables are added",
         caption="Null accuracy is 25%"
         ) +
    coord_flip()

As seen previously, dew point, temperature, and month are significant differentiating variables, driving roughly 75% accuracy in classifying the archetypes.

Adding altimeter (SLP) bumps the accuracy by another ~10%, while adding minimum cloud height and prevailing wind direction bump the accuracy by another ~5%.

This shows some of the power of the random forest algorithm as it is given additional variables to explore. Evolution of error can be plotted to see the impact:

oobError <- c()
for (ctr in 1:9) {
    oobError <- c(oobError, rfContainer[[ctr]]$rfModel$err.rate[, "OOB"])
}

tibble::tibble(nVars=rep(1:9, each=25), 
               ntrees=rep(1:25, times=9), 
               accuracy=1-oobError
               ) %>%
    ggplot(aes(x=ntrees, y=accuracy)) + 
    geom_line(aes(group=nVars, color=factor(nVars))) +
    labs(x="Number of Trees", 
         y="OOB Accuracy", 
         title="Accuracy improves with more trees and more variables"
         ) + 
    scale_color_discrete("# Variables")

With a greater number of variables, there is a greater lift in accuracy as the forest grows larger.

Suppose that an attempt is made to classify the year for a single city. Example code includes:

# Create a subset of the data for only Chicago (2014-2019)
sub_kord_data <- metarData %>%
    mutate(hr=lubridate::hour(dtime)) %>%
    filter(str_sub(source, 1 ,4) %in% "kord")

# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwha <- rfMultiLocale(sub_kord_data, 
                                 vrbls=c("TempF", "DewF", 
                                         "month", "hr",
                                         "minHeight", "ceilingHeight", 
                                         "WindSpeed", "predomDir", 
                                         "modSLP"
                                         ),
                                 locs=NULL, 
                                 locVar="year",
                                 pred="year",
                                 ntree=200, 
                                 seed=2007011309, 
                                 mtry=4
                                 )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     2016 0.798 
##  2 2014   2015      FALSE     136 0.0539
##  3 2014   2016      FALSE      85 0.0337
##  4 2014   2017      FALSE     101 0.04  
##  5 2014   2018      FALSE     102 0.0404
##  6 2014   2019      FALSE      85 0.0337
##  7 2015   2014      FALSE     126 0.048 
##  8 2015   2015      TRUE     2033 0.774 
##  9 2015   2016      FALSE     145 0.0552
## 10 2015   2017      FALSE     103 0.0392
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwha$rfModel)

# Evaluate error evolution
errorEvolution(rf_kord_TDmcwha, useCategory="OOB", subT="Chicago (2014-2019)")

## # A tibble: 200 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.491
##  2     2 OOB      0.488
##  3     3 OOB      0.489
##  4     4 OOB      0.476
##  5     5 OOB      0.469
##  6     6 OOB      0.454
##  7     7 OOB      0.442
##  8     8 OOB      0.428
##  9     9 OOB      0.418
## 10    10 OOB      0.408
## # ... with 190 more rows

Interestingly, the model predicts the year with 75%-80% accuracy (null accuracy 17%), suggesting there is significant annual variation in Chicago climate. Sea-level-pressure has the largest variable importance (SLP is a mix of altitude, temperature, dew point, and high/low pressure systems). The first 50 trees significantly reduce the OOB error, with more modest, but still continuing, error reduction afterwards.

The process is run for Las Vegas:

# Create a subset of the data for only Las Vegas (2014-2019)
sub_klas_data <- metarData %>%
    mutate(hr=lubridate::hour(dtime)) %>%
    filter(str_sub(source, 1 ,4) %in% "klas")

# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwha <- rfMultiLocale(sub_klas_data, 
                                 vrbls=c("TempF", "DewF", 
                                         "month", "hr",
                                         "minHeight", "ceilingHeight", 
                                         "WindSpeed", "predomDir", 
                                         "modSLP"
                                         ),
                                 locs=NULL, 
                                 locVar="year",
                                 pred="year",
                                 ntree=200, 
                                 seed=2007011405, 
                                 mtry=4
                                 )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwha, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     1885 0.730 
##  2 2014   2015      FALSE     175 0.0678
##  3 2014   2016      FALSE     142 0.0550
##  4 2014   2017      FALSE     149 0.0577
##  5 2014   2018      FALSE     109 0.0422
##  6 2014   2019      FALSE     121 0.0469
##  7 2015   2014      FALSE     197 0.0757
##  8 2015   2015      TRUE     1711 0.658 
##  9 2015   2016      FALSE     162 0.0623
## 10 2015   2017      FALSE     169 0.0650
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwha$rfModel)

# Evaluate error evolution
errorEvolution(rf_klas_TDmcwha, useCategory="OOB", subT="Las Vegas (2014-2019)")

## # A tibble: 200 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.576
##  2     2 OOB      0.569
##  3     3 OOB      0.565
##  4     4 OOB      0.560
##  5     5 OOB      0.554
##  6     6 OOB      0.548
##  7     7 OOB      0.536
##  8     8 OOB      0.527
##  9     9 OOB      0.518
## 10    10 OOB      0.511
## # ... with 190 more rows

Prediction accuracies are lower for Las Vegas, averaging 60%-75% (2014 appears to be much more differentiated than the other years). This is suggestive that there is less annual variation in the Las Vegas climate than there is in the Chicago climate, though still enough to meaningfully pull apart the years.

Modified seal-level pressure is the top predictor, and the majority of the OOB error reduction occurs in the first 50 trees.

Suppose that modSLP is eliminated as a predictor variable, and the process is scaled down to 50 trees and repeated for Chicago:

# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwh_50 <- rfMultiLocale(sub_kord_data, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", "hr",
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=NULL, 
                                   locVar="year",
                                   pred="year",
                                   ntree=50, 
                                   seed=2007011410, 
                                   mtry=4
                                   )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwh_50, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     1621 0.629 
##  2 2014   2015      FALSE     194 0.0753
##  3 2014   2016      FALSE     175 0.0679
##  4 2014   2017      FALSE     217 0.0842
##  5 2014   2018      FALSE     197 0.0764
##  6 2014   2019      FALSE     174 0.0675
##  7 2015   2014      FALSE     240 0.0901
##  8 2015   2015      TRUE     1554 0.583 
##  9 2015   2016      FALSE     238 0.0893
## 10 2015   2017      FALSE     196 0.0736
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwh_50$rfModel)

# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50, 
               withSLP=rf_kord_TDmcwha$rfModel$err.rate[1:50, "OOB"], 
               noSLP=rf_kord_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
               ) %>%
    pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
    ggplot(aes(x=ntree, y=Error, color=Model)) + 
    geom_line() + 
    geom_text(data=~filter(., ntree %in% c(1, 50)), 
              aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
              ) +
    labs(x="# Trees", 
         y="OOB Error Rate", 
         title="OOB Error Evolution for SLP Included/Excluded"
         ) + 
    ylim(c(0, NA))

Modified sea-level pressure is revealed to be a significant driver of prediction accuracy for Chicago, confirming findings of the previous variable importance analysis. This suggests different patterns of high and low pressure being in control by year may meaningfully improve the ability to predict Chicago by year.

The process can also be run for Las Vegas:

# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwh_50 <- rfMultiLocale(sub_klas_data, 
                                   vrbls=c("TempF", "DewF", 
                                           "month", "hr",
                                           "minHeight", "ceilingHeight", 
                                           "WindSpeed", "predomDir"
                                           ),
                                   locs=NULL, 
                                   locVar="year",
                                   pred="year",
                                   ntree=50, 
                                   seed=2007011420, 
                                   mtry=4
                                   )
## 
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwh_50, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind", 
                keyVar="year", 
                locOrder=TRUE
                )

## # A tibble: 36 x 5
##    locale predicted correct     n    pct
##    <fct>  <fct>     <lgl>   <int>  <dbl>
##  1 2014   2014      TRUE     1610 0.630 
##  2 2014   2015      FALSE     258 0.101 
##  3 2014   2016      FALSE     177 0.0693
##  4 2014   2017      FALSE     194 0.0759
##  5 2014   2018      FALSE     148 0.0579
##  6 2014   2019      FALSE     168 0.0658
##  7 2015   2014      FALSE     298 0.115 
##  8 2015   2015      TRUE     1338 0.516 
##  9 2015   2016      FALSE     228 0.0880
## 10 2015   2017      FALSE     227 0.0876
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwh_50$rfModel)

# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50, 
               withSLP=rf_klas_TDmcwha$rfModel$err.rate[1:50, "OOB"], 
               noSLP=rf_klas_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
               ) %>%
    pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
    ggplot(aes(x=ntree, y=Error, color=Model)) + 
    geom_line() + 
    geom_text(data=~filter(., ntree %in% c(1, 50)), 
              aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
              ) +
    labs(x="# Trees", 
         y="OOB Error Rate", 
         title="OOB Error Evolution for SLP Included/Excluded"
         ) + 
    ylim(c(0, NA))

Modified sea-level pressure is revealed as a meaningful driver of prediction accuracy for Las Vegas also, though with a lesser effect than seen in Chicago. This potentially suggests less variability by year in SLP for Las Vegas.

Since SLP is so meaningful, are there any patterns by year?

sub_kord_data %>%
    bind_rows(sub_klas_data, .id="cityFile") %>%
    select(cityFile, year, month, modSLP) %>%
    mutate(city=case_when(cityFile==1 ~ "Chicago", cityFile==2 ~ "Las Vegas", TRUE ~ "Error")) %>%
    filter(!is.na(modSLP)) %>%
    ggplot(aes(x=factor(year), y=modSLP)) + 
    geom_boxplot(fill="lightblue") + 
    facet_wrap(~city) +
    labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Year (Chicago and Las Vegas)")

Chicago has a higher sea-level pressure than Las Vegas (as expected), though neither city shows much variation on average from year to year. The Chicago data are explored further by month:

sub_kord_data %>%
    select(year, month, modSLP) %>%
    filter(!is.na(modSLP)) %>%
    ggplot(aes(x=month, y=modSLP)) + 
    geom_boxplot(fill="lightblue") + 
    facet_wrap(~year) +
    labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Month and Year (Chicago)")

sub_kord_data %>%
    select(year, month, modSLP) %>%
    filter(!is.na(modSLP)) %>%
    group_by(year, month) %>%
    summarize(medSLP=median(modSLP)) %>%
    ggplot(aes(x=month, y=medSLP, color=factor(year), group=year)) + 
    geom_line(lwd=1) + 
    labs(x="", y="Median modSLP", title="Median modified Sea-Level Pressure by Month and Year (Chicago)")

sub_kord_data %>%
    select(year, month, modSLP) %>%
    filter(!is.na(modSLP)) %>%
    group_by(year, month) %>%
    summarize(medSLP=median(modSLP)) %>%
    group_by(month) %>%
    summarize(sdSLP=sd(medSLP)) %>%
    ggplot(aes(x=month, y=sdSLP)) + 
    geom_point() + 
    labs(x="", 
         y="Standard Deviation of Median modSLP by Year", 
         title="Variability by year of modified Sea-Level Pressure by Month (Chicago)"
         )

There are meaningful differences in median modified sea-level pressure by month by year. This is suggestive that several year of data are needed to define an archetype, as there is risk that otherwise the archetype will be an anomalous “high pressure was in contol in June” when that is not generally applicable. And, since high vs. low pressure are often related to temperature, dew point, coludiness, wind speed, and the like, the need for multiple years of data to define an archetype likely extends to all the other variables as well.

With greater variability in SLP by year in the cold-weather months, predictive ability in the cold-weather months is expected to be higher:

rf_kord_TDmcwha$testData %>% 
    group_by(month) %>% 
    summarize(pctError=1-mean(correct)) %>%
    ggplot(aes(x=month, y=pctError)) + 
    geom_col(fill="lightblue") + 
    geom_text(aes(y=pctError/2, label=paste0(round(100*pctError, 1), "%"))) +
    labs(x="", y="Error Rate", title="Prediction Error Rate by Month (Chicago)")

As expected, the model takes advantage of greater annual variation in modSLP during the cold season to make better prediction of which year it is during the cold season.

Next an attempt is made to predict the month given the data:

# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDycwha_month_50 <- rfMultiLocale(sub_klas_data, 
                                          vrbls=c("TempF", "DewF", 
                                                  "year", "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir", 
                                                  "modSLP"
                                                  ),
                                          locs=NULL, 
                                          locVar="month",
                                          pred="month",
                                          ntree=50, 
                                          seed=2007021356, 
                                          mtry=4
                                          )
## 
## Running for locations:
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_klas_TDycwha_month_50, 
                plotCaption = "Temp, Dew Point, Year, Hour of Day, Cloud Height, Wind, Altimeter", 
                keyVar="month", 
                locOrder=TRUE
                )

## # A tibble: 105 x 5
##    locale predicted correct     n     pct
##    <fct>  <fct>     <lgl>   <int>   <dbl>
##  1 Jan    Jan       TRUE     1087 0.809  
##  2 Jan    Feb       FALSE      50 0.0372 
##  3 Jan    Mar       FALSE      27 0.0201 
##  4 Jan    Apr       FALSE       6 0.00447
##  5 Jan    May       FALSE       5 0.00372
##  6 Jan    Nov       FALSE      50 0.0372 
##  7 Jan    Dec       FALSE     118 0.0879 
##  8 Feb    Jan       FALSE      75 0.0626 
##  9 Feb    Feb       TRUE      820 0.684  
## 10 Feb    Mar       FALSE     105 0.0876 
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDycwha_month_50$rfModel)

# Evaluate error evolution
errorEvolution(rf_klas_TDycwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")

## # A tibble: 50 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.536
##  2     2 OOB      0.533
##  3     3 OOB      0.528
##  4     4 OOB      0.521
##  5     5 OOB      0.514
##  6     6 OOB      0.501
##  7     7 OOB      0.494
##  8     8 OOB      0.481
##  9     9 OOB      0.472
## 10    10 OOB      0.462
## # ... with 40 more rows

There is meaningful seasonality to the Las Vegas data such that the predicted month is frequently either the same season or (for spring/fall) the mirror season.

Suppose that year is not available as a predictor:

# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_50 <- rfMultiLocale(sub_klas_data, 
                                         vrbls=c("TempF", "DewF", 
                                                 "hr",
                                                 "minHeight", "ceilingHeight", 
                                                 "WindSpeed", "predomDir", 
                                                 "modSLP"
                                                 ),
                                         locs=NULL, 
                                         locVar="month",
                                         pred="month",
                                         ntree=50, 
                                         seed=2007021402, 
                                         mtry=4
                                         )
## 
## Running for locations:
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_50, 
                plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter", 
                keyVar="month", 
                locOrder=TRUE
                )

## # A tibble: 105 x 5
##    locale predicted correct     n     pct
##    <fct>  <fct>     <lgl>   <int>   <dbl>
##  1 Jan    Jan       TRUE      920 0.697  
##  2 Jan    Feb       FALSE      85 0.0644 
##  3 Jan    Mar       FALSE      33 0.025  
##  4 Jan    Apr       FALSE       6 0.00455
##  5 Jan    May       FALSE       5 0.00379
##  6 Jan    Nov       FALSE      70 0.0530 
##  7 Jan    Dec       FALSE     201 0.152  
##  8 Feb    Jan       FALSE     112 0.0945 
##  9 Feb    Feb       TRUE      612 0.516  
## 10 Feb    Mar       FALSE     139 0.117  
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_50$rfModel)

# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")

## # A tibble: 50 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.618
##  2     2 OOB      0.616
##  3     3 OOB      0.611
##  4     4 OOB      0.605
##  5     5 OOB      0.599
##  6     6 OOB      0.593
##  7     7 OOB      0.587
##  8     8 OOB      0.580
##  9     9 OOB      0.575
## 10    10 OOB      0.565
## # ... with 40 more rows

The same seasonal pattern is observed, though prediction accuracy falls be ~15%. This is suggestive that it is not uncommon for Las Vegas climate to run a month-ahead or a month-behind where it ran in a previous year.

The model appears to still be learning at 50 trees, so it is expanded to 150 trees:

# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_150 <- rfMultiLocale(sub_klas_data, 
                                          vrbls=c("TempF", "DewF", 
                                                  "hr",
                                                  "minHeight", "ceilingHeight", 
                                                  "WindSpeed", "predomDir", 
                                                  "modSLP"
                                                  ),
                                          locs=NULL, 
                                          locVar="month",
                                          pred="month",
                                          ntree=150, 
                                          seed=2007021402, 
                                          mtry=4
                                          )
## 
## Running for locations:
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_150, 
                plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter", 
                keyVar="month", 
                locOrder=TRUE
                )

## # A tibble: 105 x 5
##    locale predicted correct     n     pct
##    <fct>  <fct>     <lgl>   <int>   <dbl>
##  1 Jan    Jan       TRUE      927 0.702  
##  2 Jan    Feb       FALSE      75 0.0568 
##  3 Jan    Mar       FALSE      38 0.0288 
##  4 Jan    Apr       FALSE       6 0.00455
##  5 Jan    May       FALSE       5 0.00379
##  6 Jan    Nov       FALSE      69 0.0523 
##  7 Jan    Dec       FALSE     200 0.152  
##  8 Feb    Jan       FALSE      91 0.0768 
##  9 Feb    Feb       TRUE      623 0.526  
## 10 Feb    Mar       FALSE     131 0.111  
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_150$rfModel)

# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_150, useCategory="OOB", subT="Las Vegas (2014-2019)")

## # A tibble: 150 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.618
##  2     2 OOB      0.616
##  3     3 OOB      0.611
##  4     4 OOB      0.605
##  5     5 OOB      0.599
##  6     6 OOB      0.593
##  7     7 OOB      0.587
##  8     8 OOB      0.580
##  9     9 OOB      0.575
## 10    10 OOB      0.565
## # ... with 140 more rows

Another topic of interest is the use of sea-level pressure or altimeter in modeling. The altimeter setting in a METAR is captured in two ways:

  • Altimeter - the atmospheric pressure in inches of mercury, corrected for height above sea-level
  • Sea-Level Pressure (SLP) - the atmospheric pressure in millibars, with a theroertical hole drilled from the meauring station striaght down to sea-level

In theory, both metrics report almost the same thing but in different units. There is a small difference in that SLP makes an adjustment for average 12-hour temperature (since temperature is a driver of air pressure) while altimeter is purely an adjustment for altitude.

How does the model perform with neither/either/both or Altimeter and modSLP included? An example will be used for trying to classify a handful of cities from 2016:

  • Chicago
  • Milwaukee
  • New Orleans
  • Las Vegas
  • San Diego
  • Denver
  • Boston
  • Seattle
sub_mini_2016_data <- metarData %>%
    mutate(hr=lubridate::hour(dtime)) %>%
    filter(year==2016, 
           str_sub(source, 1 ,4) %in% c("kord", "kmke", "kmsy", "klas", "ksan", "kden", "kbos", "ksea")
           )

First, a model is run with neither Altimeter nor SLP:

# Run random forest for subset data (2016)
rf_mini_TDmcwh <- rfMultiLocale(sub_mini_2016_data, 
                                vrbls=c("TempF", "DewF", 
                                        "month", "hr",
                                        "minHeight", "ceilingHeight", 
                                        "WindSpeed", "predomDir"
                                        ),
                                locs=NULL, 
                                locVar="locale",
                                pred="locale",
                                ntree=100, 
                                seed=2007021422, 
                                mtry=4
                                )
## 
## Running for locations:
## [1] "Boston, MA (2016)"      "Chicago, IL (2016)"     "Denver, CO (2016)"     
## [4] "Las Vegas, NV (2016)"   "Milwaukee, WI (2016)"   "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)"   "Seattle, WA (2016)"

Prediction accuracy and variable importance can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwh, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind", 
                keyVar="locale", 
                locOrder=NULL
                )

## # A tibble: 64 x 5
##    locale             predicted              correct     n     pct
##    <fct>              <fct>                  <lgl>   <int>   <dbl>
##  1 Boston, MA (2016)  Boston, MA (2016)      TRUE     1800 0.695  
##  2 Boston, MA (2016)  Chicago, IL (2016)     FALSE     186 0.0718 
##  3 Boston, MA (2016)  Denver, CO (2016)      FALSE     158 0.0610 
##  4 Boston, MA (2016)  Las Vegas, NV (2016)   FALSE      25 0.00965
##  5 Boston, MA (2016)  Milwaukee, WI (2016)   FALSE     204 0.0788 
##  6 Boston, MA (2016)  New Orleans, LA (2016) FALSE      39 0.0151 
##  7 Boston, MA (2016)  San Diego, CA (2016)   FALSE      41 0.0158 
##  8 Boston, MA (2016)  Seattle, WA (2016)     FALSE     137 0.0529 
##  9 Chicago, IL (2016) Boston, MA (2016)      FALSE     167 0.0641 
## 10 Chicago, IL (2016) Chicago, IL (2016)     TRUE     1538 0.590  
## # ... with 54 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwh$rfModel)

# Evaluate error evolution
errorEvolution(rf_mini_TDmcwh, useCategory="OOB", subT="No Altimeter, No SLP")

## # A tibble: 100 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.374
##  2     2 OOB      0.373
##  3     3 OOB      0.369
##  4     4 OOB      0.363
##  5     5 OOB      0.355
##  6     6 OOB      0.345
##  7     7 OOB      0.335
##  8     8 OOB      0.328
##  9     9 OOB      0.317
## 10    10 OOB      0.309
## # ... with 90 more rows

Overall accuracy is roughly 75%, with the greatest challenge being separating Chicago and Milwaukee, with Boston also showing some similarities to these cities.

The process is then run with both Altimeter and SLP:

# Run random forest for subset data (2016)
rf_mini_TDmcwhas <- rfMultiLocale(sub_mini_2016_data, 
                                  vrbls=c("TempF", "DewF", 
                                          "month", "hr",
                                          "minHeight", "ceilingHeight", 
                                          "WindSpeed", "predomDir",
                                          "modSLP", "Altimeter"
                                          ),
                                  locs=NULL, 
                                  locVar="locale",
                                  pred="locale",
                                  ntree=100, 
                                  seed=2007021429, 
                                  mtry=4
                                  )
## 
## Running for locations:
## [1] "Boston, MA (2016)"      "Chicago, IL (2016)"     "Denver, CO (2016)"     
## [4] "Las Vegas, NV (2016)"   "Milwaukee, WI (2016)"   "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)"   "Seattle, WA (2016)"

Prediction accuracy and variable importance can again be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwhas, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP", 
                keyVar="locale", 
                locOrder=NULL
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 62 x 5
##    locale             predicted              correct     n     pct
##    <fct>              <fct>                  <lgl>   <int>   <dbl>
##  1 Boston, MA (2016)  Boston, MA (2016)      TRUE     2131 0.837  
##  2 Boston, MA (2016)  Chicago, IL (2016)     FALSE      90 0.0354 
##  3 Boston, MA (2016)  Denver, CO (2016)      FALSE      44 0.0173 
##  4 Boston, MA (2016)  Las Vegas, NV (2016)   FALSE      25 0.00982
##  5 Boston, MA (2016)  Milwaukee, WI (2016)   FALSE     109 0.0428 
##  6 Boston, MA (2016)  New Orleans, LA (2016) FALSE      18 0.00707
##  7 Boston, MA (2016)  San Diego, CA (2016)   FALSE      30 0.0118 
##  8 Boston, MA (2016)  Seattle, WA (2016)     FALSE      98 0.0385 
##  9 Chicago, IL (2016) Boston, MA (2016)      FALSE     112 0.0412 
## 10 Chicago, IL (2016) Chicago, IL (2016)     TRUE     1852 0.681  
## # ... with 52 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwhas$rfModel)

# Evaluate error evolution
errorEvolution(rf_mini_TDmcwhas, useCategory="OOB", subT="Both Altimeter and SLP")

## # A tibble: 100 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.316
##  2     2 OOB      0.319
##  3     3 OOB      0.319
##  4     4 OOB      0.310
##  5     5 OOB      0.299
##  6     6 OOB      0.291
##  7     7 OOB      0.282
##  8     8 OOB      0.271
##  9     9 OOB      0.261
## 10    10 OOB      0.251
## # ... with 90 more rows

Overall accuracy increases to about 85%. Except for Chicago/Milwaukee, every city is overwhelmingly classified as itself.

Smaller runs with just SLP and just Altimeter are conducted, to verify that the learning rate is roughly the same in these cases. As well SLP plus a dummy variable (SLP + rnorm()) is run:

# Run random forest for subset data (2016)
rf_mini_TDmcwhs <- rfMultiLocale(sub_mini_2016_data, 
                                 vrbls=c("TempF", "DewF", 
                                         "month", "hr",
                                         "minHeight", "ceilingHeight", 
                                         "WindSpeed", "predomDir",
                                         "modSLP"
                                         ),
                                 locs=NULL, 
                                 locVar="locale",
                                 pred="locale",
                                 ntree=50, 
                                 seed=2007021435, 
                                 mtry=4
                                 )
## 
## Running for locations:
## [1] "Boston, MA (2016)"      "Chicago, IL (2016)"     "Denver, CO (2016)"     
## [4] "Las Vegas, NV (2016)"   "Milwaukee, WI (2016)"   "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)"   "Seattle, WA (2016)"
# Run random forest for subset data (2016)
rf_mini_TDmcwha <- rfMultiLocale(sub_mini_2016_data, 
                                 vrbls=c("TempF", "DewF", 
                                         "month", "hr",
                                         "minHeight", "ceilingHeight", 
                                         "WindSpeed", "predomDir",
                                         "Altimeter"
                                         ),
                                 locs=NULL, 
                                 locVar="locale",
                                 pred="locale",
                                 ntree=50, 
                                 seed=2007021440, 
                                 mtry=4
                                 )
## 
## Running for locations:
## [1] "Boston, MA (2016)"      "Chicago, IL (2016)"     "Denver, CO (2016)"     
## [4] "Las Vegas, NV (2016)"   "Milwaukee, WI (2016)"   "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)"   "Seattle, WA (2016)"
set.seed(2007021444)
subUseData <- sub_mini_2016_data %>%
    mutate(dummy=modSLP + rnorm(n=nrow(sub_mini_2016_data)))

# Run random forest for subset data (2016)
rf_mini_TDmcwhsd <- rfMultiLocale(subUseData, 
                                  vrbls=c("TempF", "DewF", 
                                          "month", "hr",
                                          "minHeight", "ceilingHeight", 
                                          "WindSpeed", "predomDir",
                                          "modSLP", "dummy"
                                          ),
                                  locs=NULL, 
                                  locVar="locale",
                                  pred="locale",
                                  ntree=50, 
                                  seed=2007021445, 
                                  mtry=4
                                  )
## 
## Running for locations:
## [1] "Boston, MA (2016)"      "Chicago, IL (2016)"     "Denver, CO (2016)"     
## [4] "Las Vegas, NV (2016)"   "Milwaukee, WI (2016)"   "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)"   "Seattle, WA (2016)"

OOB error evolution through the first 50 trees can be compared:

# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50, 
               neither=rf_mini_TDmcwh$rfModel$err.rate[1:50, "OOB"], 
               altimeter=rf_mini_TDmcwha$rfModel$err.rate[1:50, "OOB"], 
               modslp=rf_mini_TDmcwhs$rfModel$err.rate[1:50, "OOB"],
               slp_dummy=rf_mini_TDmcwhsd$rfModel$err.rate[1:50, "OOB"], 
               both=rf_mini_TDmcwhas$rfModel$err.rate[1:50, "OOB"]
               ) %>%
    pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
    ggplot(aes(x=ntree, y=Error, color=Model)) + 
    geom_line() + 
    geom_text(data=~filter(., ntree %in% c(1, 50)), 
              aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
              ) +
    labs(x="# Trees", 
         y="OOB Error Rate", 
         title="OOB Error Evolution for SLP Included/Excluded"
         ) + 
    ylim(c(0, NA))

Interestingly, the inclusion of both Altimeter and modSLP appears to drive a 2%-3% decrease in classification error. The main difference between the variables is that modSLP contains a hint about the previous 12-hour temperature.

Importantly, including a dummy variable with modSLP has either no impact or a slight negative impact on model performance. So, the addition of Altimeter to modSLP is likely a real impact, not merely the inclusion of two highly correlated variables that are both valuable to driving model performance.

Next, an attempt is made to compare every grouping of two cities, using all variables, mtry of 4, and a very small forest of 15 trees:

# Create a container list to hold the output
list_varimp_2016 <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))

# Set a random seed
set.seed(2007031342)

# Loop through all possible combinations
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
    for (ctr2 in (ctr+1):length(names_2016)) {
        list_varimp_2016[[n]] <- rfTwoLocales(mutate(metarData, hr=lubridate::hour(dtime)), 
                                              loc1=names_2016[ctr], 
                                              loc2=names_2016[ctr2], 
                                              vrbls=c("TempF", "DewF", 
                                                      "month", "hr",
                                                      "minHeight", "ceilingHeight", 
                                                      "WindSpeed", "predomDir", 
                                                      "modSLP", "Altimeter"
                                                      ),
                                              ntree=15, 
                                              mtry=4
                                              )
        n <- n + 1
        if ((n %% 40) == 0) { cat("Through number:", n, "\n")}
    }
}
## Through number: 40 
## Through number: 80 
## Through number: 120 
## Through number: 160 
## Through number: 200 
## Through number: 240 
## Through number: 280 
## Through number: 320 
## Through number: 360 
## Through number: 400
# Create a tibble from the underlying accuracy data
acc_varimp_2016 <- map_dfr(list_varimp_2016, .f=helperAccuracyLocale)

# Assess the top 20 classification accuracies
acc_varimp_2016 %>%
    arrange(-accOverall) %>%
    head(20)
## # A tibble: 20 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Denver, CO (2016)      Miami, FL (2016)           0.998      0.998      0.998
##  2 Denver, CO (2016)      Tampa Bay, FL (2016)       0.996      0.998      0.995
##  3 Las Vegas, NV (2016)   Miami, FL (2016)           0.995      0.995      0.995
##  4 Denver, CO (2016)      New Orleans, LA (201~      0.995      0.996      0.994
##  5 Denver, CO (2016)      San Diego, CA (2016)       0.994      0.994      0.994
##  6 Denver, CO (2016)      Houston, TX (2016)         0.993      0.994      0.992
##  7 Denver, CO (2016)      San Francisco, CA (2~      0.993      0.994      0.992
##  8 Miami, FL (2016)       Seattle, WA (2016)         0.992      0.990      0.994
##  9 Miami, FL (2016)       Phoenix, AZ (2016)         0.992      0.991      0.992
## 10 Miami, FL (2016)       Traverse City, MI (2~      0.991      0.993      0.990
## 11 Miami, FL (2016)       Minneapolis, MN (201~      0.991      0.991      0.990
## 12 Boston, MA (2016)      Miami, FL (2016)           0.991      0.990      0.991
## 13 Denver, CO (2016)      Los Angeles, CA (201~      0.991      0.993      0.989
## 14 Denver, CO (2016)      San Jose, CA (2016)        0.990      0.991      0.989
## 15 Denver, CO (2016)      San Antonio, TX (201~      0.990      0.992      0.988
## 16 Grand Rapids, MI (201~ Miami, FL (2016)           0.990      0.990      0.990
## 17 Green Bay, WI (2016)   Miami, FL (2016)           0.990      0.989      0.990
## 18 Miami, FL (2016)       Milwaukee, WI (2016)       0.989      0.989      0.989
## 19 Madison, WI (2016)     Miami, FL (2016)           0.989      0.986      0.992
## 20 Las Vegas, NV (2016)   Tampa Bay, FL (2016)       0.989      0.990      0.987
# Assess the bottom 20 classification accuracies
acc_varimp_2016 %>%
    arrange(accOverall) %>%
    head(20)
## # A tibble: 20 x 5
##    locale1                locale2               accOverall accLocale1 accLocale2
##    <chr>                  <chr>                      <dbl>      <dbl>      <dbl>
##  1 Chicago, IL (2016)     Milwaukee, WI (2016)       0.675      0.685      0.666
##  2 Newark, NJ (2016)      Philadelphia, PA (20~      0.676      0.687      0.665
##  3 Detroit, MI (2016)     Grand Rapids, MI (20~      0.712      0.718      0.706
##  4 Madison, WI (2016)     Milwaukee, WI (2016)       0.718      0.725      0.711
##  5 Philadelphia, PA (201~ Washington, DC (2016)      0.730      0.741      0.718
##  6 Chicago, IL (2016)     Grand Rapids, MI (20~      0.731      0.750      0.711
##  7 Green Bay, WI (2016)   Madison, WI (2016)         0.737      0.747      0.728
##  8 Chicago, IL (2016)     Detroit, MI (2016)         0.740      0.746      0.735
##  9 Chicago, IL (2016)     Madison, WI (2016)         0.745      0.756      0.733
## 10 Grand Rapids, MI (201~ Milwaukee, WI (2016)       0.746      0.747      0.744
## 11 Green Bay, WI (2016)   Milwaukee, WI (2016)       0.754      0.757      0.751
## 12 Madison, WI (2016)     Minneapolis, MN (201~      0.758      0.759      0.757
## 13 Grand Rapids, MI (201~ Madison, WI (2016)         0.759      0.763      0.755
## 14 Detroit, MI (2016)     Milwaukee, WI (2016)       0.765      0.777      0.753
## 15 Grand Rapids, MI (201~ Traverse City, MI (2~      0.769      0.770      0.768
## 16 Detroit, MI (2016)     Indianapolis, IN (20~      0.772      0.775      0.770
## 17 Chicago, IL (2016)     Indianapolis, IN (20~      0.775      0.777      0.774
## 18 Boston, MA (2016)      Newark, NJ (2016)          0.778      0.781      0.776
## 19 Indianapolis, IN (201~ Saint Louis, MO (201~      0.779      0.791      0.768
## 20 Madison, WI (2016)     Traverse City, MI (2~      0.783      0.782      0.784

As pre previous, the best accuracies are obtained when comparing cities in very different climates (e.g., Denver vs. Humid/Marine or Miami vs. Desert/Cold), while the worst accuracies are obtained when comparing very similar cities (e.g., Chicago and Milwaukee or Newar and Philadelphia).

Variable importance can then be assessed across all 1:1 classifications:

# Create a tibble of all the variable importance data
val_varimp_2016 <- map_dfr(list_varimp_2016, 
                           .f=function(x) { x$rfModel %>% 
                                   caret::varImp() %>% 
                                   t() %>% 
                                   as.data.frame()
                               }
                           ) %>% 
    tibble::as_tibble()
# Create boxplot of overall variable importance
val_varimp_2016 %>% 
    mutate(num=1:nrow(val_varimp_2016)) %>% 
    pivot_longer(-num, names_to="variable", values_to="varImp") %>% 
    ggplot(aes(x=fct_reorder(variable, varImp), y=varImp)) + 
    geom_boxplot(fill="lightblue") + 
    labs(x="", 
         y="Variable Importance", 
         title="Variable Importance for 1:1 Random Forest Classifications"
         )

# Attach the city names and OOB error rate
tbl_varimp_2016 <- sapply(list_varimp_2016, 
                          FUN=function(x) { c(names(x$errorRate[2:3]), x$errorRate["OOB"]) }
                          ) %>%
    t() %>% 
    as.data.frame() %>% 
    bind_cols(val_varimp_2016) %>% 
    tibble::as_tibble() %>% 
    mutate(OOB=as.numeric(as.character(OOB))) %>%
    rename(locale1=V1, 
           locale2=V2
           )

# Plot accuracy vs. spikiness of variable importance
tbl_varimp_2016 %>%
    pivot_longer(-c(locale1, locale2, OOB), names_to="var", values_to="varImp") %>% 
    group_by(locale1, locale2, OOB) %>% 
    summarize(mean=mean(varImp), max=max(varImp)) %>% 
    mutate(maxMean=max/mean) %>%
    ggplot(aes(x=maxMean, y=1-OOB)) + 
    geom_point() + 
    geom_smooth(method="loess") +
    labs(x="Ratio of Maximum Variable Importance to Mean Variable Importance", 
         y="OOB Accuracy", 
         title="Accuracy vs. Spikiness of Variable Importance"
         )

Broadly speaking, the same variables that drive overall classification are important in driving 1:1 classifications. There is meaningful spikiness, suggesting that different 1:1 classifications rely on different variables.

There is a strong trend where the best accuracies are obtained where there is a single spiky dimension that drives the classifications. This suggests that while the model can take advantage of all 10 variables, it has the easiest tome when there is a single, well-differentiated variable. No surprise.

An attempt is made to define archetype cities as cities that are either 1) very different from other cities, or 2) very similar to other cities. An archetype should always meet criteria for some meaningful examples, and criteria 2 allows for an archetype to be similar to several other cities of the same archetype:

oobData <- tibble::tibble(City=character(0), OOBLow=numeric(0), OOBHigh=numeric(0))

for (city in cityNameMapper[names_2016]) {
    
    cityData <- tbl_varimp_2016 %>%
        filter(locale1 == city | locale2 == city)
    
    lowOOB <- cityData %>%
        top_n(n=10, wt=-OOB) %>%
        pull(OOB) %>%
        mean()
    
    highOOB <- cityData %>%
        top_n(n=5, wt=OOB) %>%
        pull(OOB) %>%
        mean()
    
    cat("\nCity:", city, 
        "\tOOB (Low 10 Average):", round(lowOOB, 4), 
        "\tOOB (High 5 Average):", round(highOOB, 4)
        )
    
    oobData <- bind_rows(oobData, tibble::tibble(City=city, OOBLow=lowOOB, OOBHigh=highOOB))
    
}
## 
## City: Atlanta, GA (2016)     OOB (Low 10 Average): 0.0435    OOB (High 5 Average): 0.1349
## City: Boston, MA (2016)  OOB (Low 10 Average): 0.0257    OOB (High 5 Average): 0.1907
## City: Washington, DC (2016)  OOB (Low 10 Average): 0.0398    OOB (High 5 Average): 0.1903
## City: Denver, CO (2016)  OOB (Low 10 Average): 0.0076    OOB (High 5 Average): 0.0609
## City: Dallas, TX (2016)  OOB (Low 10 Average): 0.0409    OOB (High 5 Average): 0.1416
## City: Detroit, MI (2016)     OOB (Low 10 Average): 0.0273    OOB (High 5 Average): 0.2446
## City: Newark, NJ (2016)  OOB (Low 10 Average): 0.0366    OOB (High 5 Average): 0.2226
## City: Green Bay, WI (2016)   OOB (Low 10 Average): 0.0202    OOB (High 5 Average): 0.2269
## City: Grand Rapids, MI (2016)    OOB (Low 10 Average): 0.0238    OOB (High 5 Average): 0.2567
## City: Houston, TX (2016)     OOB (Low 10 Average): 0.0232    OOB (High 5 Average): 0.1403
## City: Indianapolis, IN (2016)    OOB (Low 10 Average): 0.0356    OOB (High 5 Average): 0.2127
## City: Las Vegas, NV (2016)   OOB (Low 10 Average): 0.0162    OOB (High 5 Average): 0.0595
## City: Los Angeles, CA (2016)     OOB (Low 10 Average): 0.0295    OOB (High 5 Average): 0.0943
## City: Lincoln, NE (2016)     OOB (Low 10 Average): 0.0303    OOB (High 5 Average): 0.1682
## City: Miami, FL (2016)   OOB (Low 10 Average): 0.0082    OOB (High 5 Average): 0.1027
## City: Milwaukee, WI (2016)   OOB (Low 10 Average): 0.0228    OOB (High 5 Average): 0.2685
## City: Madison, WI (2016)     OOB (Low 10 Average): 0.0243    OOB (High 5 Average): 0.2567
## City: Minneapolis, MN (2016)     OOB (Low 10 Average): 0.022     OOB (High 5 Average): 0.205
## City: New Orleans, LA (2016)     OOB (Low 10 Average): 0.0164    OOB (High 5 Average): 0.1368
## City: Chicago, IL (2016)     OOB (Low 10 Average): 0.0275    OOB (High 5 Average): 0.2667
## City: Philadelphia, PA (2016)    OOB (Low 10 Average): 0.0398    OOB (High 5 Average): 0.2211
## City: Phoenix, AZ (2016)     OOB (Low 10 Average): 0.0158    OOB (High 5 Average): 0.0647
## City: San Diego, CA (2016)   OOB (Low 10 Average): 0.0226    OOB (High 5 Average): 0.0843
## City: San Antonio, TX (2016)     OOB (Low 10 Average): 0.0265    OOB (High 5 Average): 0.1022
## City: Seattle, WA (2016)     OOB (Low 10 Average): 0.021     OOB (High 5 Average): 0.0756
## City: San Francisco, CA (2016)   OOB (Low 10 Average): 0.0244    OOB (High 5 Average): 0.0895
## City: San Jose, CA (2016)    OOB (Low 10 Average): 0.0334    OOB (High 5 Average): 0.1003
## City: Saint Louis, MO (2016)     OOB (Low 10 Average): 0.0396    OOB (High 5 Average): 0.1737
## City: Tampa Bay, FL (2016)   OOB (Low 10 Average): 0.0136    OOB (High 5 Average): 0.1332
## City: Traverse City, MI (2016)   OOB (Low 10 Average): 0.0221    OOB (High 5 Average): 0.2158
# Plot OOB data summary
oobData %>%
    ggplot(aes(x=OOBLow, y=OOBHigh)) + 
    geom_text(aes(label=City)) + 
    labs(x="Average OOB Error of Best 10 1:1 Predictions", 
         y="Average OOB Error of Worst 5 1:1 Predictions"
         )

At a glance, Miami, Denver and Phoenix/Las Vegas make for great archetypes in this data. They are broadly dissimilar from a large number of cities and not all that similar to many other cities. New Orleans or Tampa may also make for good archetypes (somewhat surprisingly in that these should be similar to Miami).

Suppose that hierarchical clustering is attempted, using 1-OOB as the main ‘distance’ variable:

# Create both halves of what will become the distance matric
upperHalf <- tbl_varimp_2016 %>%
    mutate(locale1=as.character(locale1), locale2=as.character(locale2))

lowerHalf <- upperHalf %>%
    rename(locale2=locale1, locale1=locale2) %>%
    select(locale1, locale2, everything())

bothHalf <- upperHalf %>%
    bind_rows(lowerHalf)

# Select only locale1, locale2, OOB and create a record for each locale to itself
mtxData <- bothHalf %>%
    select(locale1, locale2, OOB) %>%
    bind_rows(tibble::tibble(locale1=unique(bothHalf$locale1), locale2=locale1, OOB=NA)) %>%
    arrange(locale1, locale2) %>%
    mutate(accuracy=1-OOB)

# Create the matrix and add row/column names
mtxOOB <- matrix(data=mtxData$accuracy, nrow=sqrt(nrow(mtxData)), ncol=sqrt(nrow(mtxData)), byrow=TRUE)
dimnames(mtxOOB) <- list(unique(mtxData$locale1), unique(mtxData$locale1))

# Convert to distance matrix
distOOB <- as.dist(mtxOOB)

# Run hierarchical clustering
distOOB %>%
    hclust(method="complete") %>%
    plot()

Seattle and Denver appear to be good stand-alone cities. They are not particularly close to any other cities.

Suppose that an attempt is made to classify cities as Seattle, Denver, or Other:

# Create data for Denver/Seattle/Other
sub_densea_data <- metarData %>%
    mutate(place=case_when(source=="kden_2016" ~ "Denver", 
                           source=="ksea_2016" ~ "Seattle", 
                           TRUE ~ "Other"
                           )
           ) %>%
    filter(year==2016)

# Down-sample all to smallest 'place'
nSmallPlace <- sub_densea_data %>%
    count(place) %>%
    pull(n) %>%
    min()

# Group the data and sample to only this minimum size
set.seed(2007031504)

sub_densea_data <- sub_densea_data %>%
    group_by(place) %>%
    sample_n(size=nSmallPlace) %>%
    ungroup() %>%
    mutate(hr=lubridate::hour(dtime))

# Run random forest for subset data
rf_densea_TDmcwhsa <- rfMultiLocale(sub_densea_data, 
                                    vrbls=c("TempF", "DewF", 
                                            "month", "hr",
                                            "minHeight", "ceilingHeight", 
                                            "WindSpeed", "predomDir",
                                            "modSLP", "Altimeter"
                                            ),
                                    locs=NULL, 
                                    locVar="place",
                                    pred="place",
                                    ntree=100, 
                                    seed=2007031510, 
                                    mtry=4
                                    )
## 
## Running for locations:
## [1] "Denver"  "Other"   "Seattle"
# Evaluate prediction accuracy
evalPredictions(rf_densea_TDmcwhsa, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP", 
                keyVar="place", 
                locOrder=NULL
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 9 x 5
##   locale  predicted correct     n     pct
##   <fct>   <fct>     <lgl>   <int>   <dbl>
## 1 Denver  Denver    TRUE     2522 0.960  
## 2 Denver  Other     FALSE      83 0.0316 
## 3 Denver  Seattle   FALSE      21 0.00800
## 4 Other   Denver    FALSE     180 0.0685 
## 5 Other   Other     TRUE     2182 0.831  
## 6 Other   Seattle   FALSE     265 0.101  
## 7 Seattle Denver    FALSE      22 0.00843
## 8 Seattle Other     FALSE     115 0.0441 
## 9 Seattle Seattle   TRUE     2472 0.947
# Evaluate variable importance
helperPlotVarImp(rf_densea_TDmcwhsa$rfModel)

# Evaluate error evolution
errorEvolution(rf_densea_TDmcwhsa, subT="Denver, Seattle, Other")

## # A tibble: 400 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.180
##  2     1 Denver   0.115
##  3     1 Other    0.264
##  4     1 Seattle  0.162
##  5     2 OOB      0.191
##  6     2 Denver   0.135
##  7     2 Other    0.268
##  8     2 Seattle  0.169
##  9     3 OOB      0.189
## 10     3 Denver   0.122
## # ... with 390 more rows

The model is very good at picking out Seattle and Denver, but not so good at picking out that Other is not Seattle or Denver. A different approach may be needed to find stand-alone cities and archetypes.

An initial attempt is made to classify all of the 2016 cities, using a small forest of 25 trees, with the objective of finding cities that are frequently classified as each other:

# Run random forest for subset data
rf_all_nt25_TDmcwhsa <- rfMultiLocale(filter(mutate(metarData, hr=lubridate::hour(dtime)), year==2016), 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP", "Altimeter"
                                              ),
                                      locs=NULL, 
                                      locVar="locale",
                                      pred="locale",
                                      ntree=25, 
                                      seed=2007041352, 
                                      mtry=4
                                      )
## 
## Running for locations:
##  [1] "Atlanta, GA (2016)"       "Boston, MA (2016)"       
##  [3] "Chicago, IL (2016)"       "Dallas, TX (2016)"       
##  [5] "Denver, CO (2016)"        "Detroit, MI (2016)"      
##  [7] "Grand Rapids, MI (2016)"  "Green Bay, WI (2016)"    
##  [9] "Houston, TX (2016)"       "Indianapolis, IN (2016)" 
## [11] "Las Vegas, NV (2016)"     "Lincoln, NE (2016)"      
## [13] "Los Angeles, CA (2016)"   "Madison, WI (2016)"      
## [15] "Miami, FL (2016)"         "Milwaukee, WI (2016)"    
## [17] "Minneapolis, MN (2016)"   "New Orleans, LA (2016)"  
## [19] "Newark, NJ (2016)"        "Philadelphia, PA (2016)" 
## [21] "Phoenix, AZ (2016)"       "Saint Louis, MO (2016)"  
## [23] "San Antonio, TX (2016)"   "San Diego, CA (2016)"    
## [25] "San Francisco, CA (2016)" "San Jose, CA (2016)"     
## [27] "Seattle, WA (2016)"       "Tampa Bay, FL (2016)"    
## [29] "Traverse City, MI (2016)" "Washington, DC (2016)"

Accuracy can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_all_nt25_TDmcwhsa, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP", 
                keyVar="locale", 
                locOrder=NULL
                )

## # A tibble: 863 x 5
##    locale             predicted               correct     n     pct
##    <fct>              <fct>                   <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)      TRUE     1776 0.663  
##  2 Atlanta, GA (2016) Boston, MA (2016)       FALSE      12 0.00448
##  3 Atlanta, GA (2016) Chicago, IL (2016)      FALSE      12 0.00448
##  4 Atlanta, GA (2016) Dallas, TX (2016)       FALSE      91 0.0340 
##  5 Atlanta, GA (2016) Denver, CO (2016)       FALSE      10 0.00373
##  6 Atlanta, GA (2016) Detroit, MI (2016)      FALSE      20 0.00747
##  7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE       9 0.00336
##  8 Atlanta, GA (2016) Green Bay, WI (2016)    FALSE       8 0.00299
##  9 Atlanta, GA (2016) Houston, TX (2016)      FALSE      43 0.0161 
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE      44 0.0164 
## # ... with 853 more rows
# Evaluate variable importance
helperPlotVarImp(rf_all_nt25_TDmcwhsa$rfModel)

# Evaluate error evolution
errorEvolution(rf_all_nt25_TDmcwhsa, useCategory="OOB", subT="All 30 cities individually")

## # A tibble: 25 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.604
##  2     2 OOB      0.606
##  3     3 OOB      0.603
##  4     4 OOB      0.597
##  5     5 OOB      0.589
##  6     6 OOB      0.580
##  7     7 OOB      0.568
##  8     8 OOB      0.555
##  9     9 OOB      0.545
## 10    10 OOB      0.535
## # ... with 15 more rows

There are frequently misclassified cities that might benefit from being grouped and classified as such:

  • Las Vegas and Phoenix
  • Miami and Tampa
  • San Diego and Los Angeles
  • San Francisco and San Jose
  • New Orleans and Houston (may also match well with Miami and Tampa)
  • DC, Philadelphia, and Newark
  • Chicago and Milwaukee
  • Detroit and Grand Rapids

Encouragingly, this is broadly the same as the findings from the hierarchical clustering. The model is still learning at a decent rate, so the 25-tree forest is, as expected, too small.

A mapping file is created to map relevant cities together:

# Create the locale mapper
locMapperTibble_ss01 <- tibble::tribble(
    ~source, ~locType,
    'katl_2016', 'Same',
    'kbos_2016', 'Same',
    'kdca_2016', 'DC-PHL-EWR (2016)',
    'kden_2016', 'Same',
    'kdfw_2016', 'Same',
    'kdtw_2016', 'Detroit-Grand Rapids (2016)',
    'kewr_2016', 'DC-PHL-EWR (2016)',
    'kgrb_2016', 'Same',
    'kgrr_2016', 'Detroit-Grand Rapids (2016)',
    'kiah_2016', 'Houston-New Orleans (2016)',
    'kind_2016', 'Same',
    'klas_2016', 'Las Vegas-Phoenix (2016)',
    'klax_2016', 'Los Angeles-San Diego (2016)',
    'klnk_2016', 'Same',
    'kmia_2016', 'Miami-Tampa Bay (2016)',
    'kmke_2016', 'Chicago-Milwaukee (2016)',
    'kmsn_2016', 'Same',
    'kmsp_2016', 'Same',
    'kmsy_2016', 'Houston-New Orleans (2016)',
    'kord_2016', 'Chicago-Milwaukee (2016)',
    'kphl_2016', 'DC-PHL-EWR (2016)',
    'kphx_2016', 'Las Vegas-Phoenix (2016)',
    'ksan_2016', 'Los Angeles-San Diego (2016)',
    'ksat_2016', 'Same',
    'ksea_2016', 'Same',
    'ksfo_2016', 'Bay Area (2016)',
    'ksjc_2016', 'Bay Area (2016)',
    'kstl_2016', 'Same',
    'ktpa_2016', 'Miami-Tampa Bay (2016)',
    'ktvc_2016', 'Same'
)

# Create locMapper
locMapper_ss01 <- locMapperTibble_ss01 %>% pull(locType)
names(locMapper_ss01) <- locMapperTibble_ss01 %>% pull(source)
locMapper_ss01
##                      katl_2016                      kbos_2016 
##                         "Same"                         "Same" 
##                      kdca_2016                      kden_2016 
##            "DC-PHL-EWR (2016)"                         "Same" 
##                      kdfw_2016                      kdtw_2016 
##                         "Same"  "Detroit-Grand Rapids (2016)" 
##                      kewr_2016                      kgrb_2016 
##            "DC-PHL-EWR (2016)"                         "Same" 
##                      kgrr_2016                      kiah_2016 
##  "Detroit-Grand Rapids (2016)"   "Houston-New Orleans (2016)" 
##                      kind_2016                      klas_2016 
##                         "Same"     "Las Vegas-Phoenix (2016)" 
##                      klax_2016                      klnk_2016 
## "Los Angeles-San Diego (2016)"                         "Same" 
##                      kmia_2016                      kmke_2016 
##       "Miami-Tampa Bay (2016)"     "Chicago-Milwaukee (2016)" 
##                      kmsn_2016                      kmsp_2016 
##                         "Same"                         "Same" 
##                      kmsy_2016                      kord_2016 
##   "Houston-New Orleans (2016)"     "Chicago-Milwaukee (2016)" 
##                      kphl_2016                      kphx_2016 
##            "DC-PHL-EWR (2016)"     "Las Vegas-Phoenix (2016)" 
##                      ksan_2016                      ksat_2016 
## "Los Angeles-San Diego (2016)"                         "Same" 
##                      ksea_2016                      ksfo_2016 
##                         "Same"              "Bay Area (2016)" 
##                      ksjc_2016                      kstl_2016 
##              "Bay Area (2016)"                         "Same" 
##                      ktpa_2016                      ktvc_2016 
##       "Miami-Tampa Bay (2016)"                         "Same"
# Create the data file with locType
locData_ss01 <- metarData %>% 
    mutate(locType=ifelse(locMapper_ss01[source]=="Same", locale, locMapper_ss01[source]), 
           hr=lubridate::hour(dtime)) %>% 
    filter(year==2016, locType !="Exclude")

# Counts by locType
locData_ss01 %>%
    count(locType) %>%
    arrange(-n) %>%
    as.data.frame()
##                         locType     n
## 1             DC-PHL-EWR (2016) 26252
## 2      Las Vegas-Phoenix (2016) 17543
## 3        Miami-Tampa Bay (2016) 17539
## 4  Los Angeles-San Diego (2016) 17536
## 5    Houston-New Orleans (2016) 17535
## 6   Detroit-Grand Rapids (2016) 17534
## 7      Chicago-Milwaukee (2016) 17527
## 8               Bay Area (2016) 17526
## 9            Atlanta, GA (2016)  8772
## 10            Dallas, TX (2016)  8772
## 11            Denver, CO (2016)  8772
## 12       Minneapolis, MN (2016)  8769
## 13     Traverse City, MI (2016)  8766
## 14           Lincoln, NE (2016)  8765
## 15       San Antonio, TX (2016)  8765
## 16           Seattle, WA (2016)  8765
## 17         Green Bay, WI (2016)  8755
## 18           Madison, WI (2016)  8750
## 19      Indianapolis, IN (2016)  8720
## 20       Saint Louis, MO (2016)  8702
## 21            Boston, MA (2016)  8663

The data are then sampled such that each locType is equally prevalent:

# Set a seed for reporducibility
set.seed(2007041419)

# Find the smallest locale type
nSmall <- locData_ss01 %>%
    filter(!is.na(locType)) %>%
    count(locType) %>%
    pull(n) %>%
    min()

# Create the relevant data subset
ss01_data <- locData_ss01 %>%
    filter(!is.na(locType)) %>%
    group_by(locType) %>%
    sample_n(size=nSmall, replace=FALSE) %>%
    ungroup()

# Sumarize the data subset
ss01_data %>% 
    count(locale) %>% 
    arrange(-n) %>%
    as.data.frame()
##                      locale    n
## 1        Atlanta, GA (2016) 8663
## 2         Boston, MA (2016) 8663
## 3         Dallas, TX (2016) 8663
## 4         Denver, CO (2016) 8663
## 5      Green Bay, WI (2016) 8663
## 6   Indianapolis, IN (2016) 8663
## 7        Lincoln, NE (2016) 8663
## 8        Madison, WI (2016) 8663
## 9    Minneapolis, MN (2016) 8663
## 10   Saint Louis, MO (2016) 8663
## 11   San Antonio, TX (2016) 8663
## 12       Seattle, WA (2016) 8663
## 13 Traverse City, MI (2016) 8663
## 14 San Francisco, CA (2016) 4400
## 15     Las Vegas, NV (2016) 4376
## 16     Milwaukee, WI (2016) 4362
## 17         Miami, FL (2016) 4360
## 18   New Orleans, LA (2016) 4357
## 19     San Diego, CA (2016) 4345
## 20       Detroit, MI (2016) 4334
## 21  Grand Rapids, MI (2016) 4329
## 22   Los Angeles, CA (2016) 4318
## 23       Houston, TX (2016) 4306
## 24     Tampa Bay, FL (2016) 4303
## 25       Chicago, IL (2016) 4301
## 26       Phoenix, AZ (2016) 4287
## 27      San Jose, CA (2016) 4263
## 28  Philadelphia, PA (2016) 2909
## 29    Washington, DC (2016) 2896
## 30        Newark, NJ (2016) 2858

A random forest can then be run to predict locType (using a slightly bigger forest of 50 trees:

# Run random forest for subset data
rf_s01_nt50_TDmcwhsa <- rfMultiLocale(ss01_data, 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP", "Altimeter"
                                              ),
                                      locs=NULL, 
                                      locVar="locType",
                                      pred="locType",
                                      otherVar=c("dtime", "locale", "source"),
                                      ntree=50, 
                                      seed=2007041423, 
                                      mtry=4
                                      )
## 
## Running for locations:
##  [1] "Atlanta, GA (2016)"           "Bay Area (2016)"             
##  [3] "Boston, MA (2016)"            "Chicago-Milwaukee (2016)"    
##  [5] "Dallas, TX (2016)"            "DC-PHL-EWR (2016)"           
##  [7] "Denver, CO (2016)"            "Detroit-Grand Rapids (2016)" 
##  [9] "Green Bay, WI (2016)"         "Houston-New Orleans (2016)"  
## [11] "Indianapolis, IN (2016)"      "Las Vegas-Phoenix (2016)"    
## [13] "Lincoln, NE (2016)"           "Los Angeles-San Diego (2016)"
## [15] "Madison, WI (2016)"           "Miami-Tampa Bay (2016)"      
## [17] "Minneapolis, MN (2016)"       "Saint Louis, MO (2016)"      
## [19] "San Antonio, TX (2016)"       "Seattle, WA (2016)"          
## [21] "Traverse City, MI (2016)"

Accuracy can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_s01_nt50_TDmcwhsa, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP", 
                keyVar="locType", 
                locOrder=NULL
                )

## # A tibble: 432 x 5
##    locale             predicted                   correct     n     pct
##    <fct>              <fct>                       <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)          TRUE     1840 0.718  
##  2 Atlanta, GA (2016) Bay Area (2016)             FALSE      60 0.0234 
##  3 Atlanta, GA (2016) Boston, MA (2016)           FALSE      21 0.00819
##  4 Atlanta, GA (2016) Chicago-Milwaukee (2016)    FALSE      17 0.00663
##  5 Atlanta, GA (2016) Dallas, TX (2016)           FALSE      83 0.0324 
##  6 Atlanta, GA (2016) DC-PHL-EWR (2016)           FALSE      60 0.0234 
##  7 Atlanta, GA (2016) Denver, CO (2016)           FALSE       8 0.00312
##  8 Atlanta, GA (2016) Detroit-Grand Rapids (2016) FALSE      28 0.0109 
##  9 Atlanta, GA (2016) Green Bay, WI (2016)        FALSE       8 0.00312
## 10 Atlanta, GA (2016) Houston-New Orleans (2016)  FALSE      52 0.0203 
## # ... with 422 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s01_nt50_TDmcwhsa$rfModel)

# Evaluate error evolution
errorEvolution(rf_s01_nt50_TDmcwhsa, useCategory="OOB", subT="Subset 01: 21 Groups of 1-3 Locales")

## # A tibble: 50 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.581
##  2     2 OOB      0.583
##  3     3 OOB      0.579
##  4     4 OOB      0.569
##  5     5 OOB      0.561
##  6     6 OOB      0.552
##  7     7 OOB      0.541
##  8     8 OOB      0.529
##  9     9 OOB      0.517
## 10    10 OOB      0.506
## # ... with 40 more rows

A few other consolidations appear merited:

  • Miami-Tampa and Houston-New Orleans
  • Boston with DC-PHL-EWR
  • Chicago-Milwaukee and Detroit-Grand Rapids with Madison and Green Bay
  • Saint Louis and Indianapolis
  • Dallas and San Antonio

Example code includes:

# Create the locale mapper
locMapperTibble_ss02 <- tibble::tribble(
    ~source, ~locType,
    'katl_2016', 'Same',
    'kbos_2016', 'DC-PHL-EWR-BOS (2016)',
    'kdca_2016', 'DC-PHL-EWR-BOS (2016)',
    'kden_2016', 'Same',
    'kdfw_2016', 'Dallas-San Antonio (2016)',
    'kdtw_2016', 'Cold Midwest (2016)',
    'kewr_2016', 'DC-PHL-EWR-BOS (2016)',
    'kgrb_2016', 'Cold Midwest (2016)',
    'kgrr_2016', 'Cold Midwest (2016)',
    'kiah_2016', 'Gulf Coast (2016)',
    'kind_2016', 'St Louis-Indianapolis (2016)',
    'klas_2016', 'Las Vegas-Phoenix (2016)',
    'klax_2016', 'Los Angeles-San Diego (2016)',
    'klnk_2016', 'Same',
    'kmia_2016', 'Gulf Coast (2016)',
    'kmke_2016', 'Cold Midwest (2016)',
    'kmsn_2016', 'Cold Midwest (2016)',
    'kmsp_2016', 'Same',
    'kmsy_2016', 'Gulf Coast (2016)',
    'kord_2016', 'Cold Midwest (2016)',
    'kphl_2016', 'DC-PHL-EWR-BOS (2016)',
    'kphx_2016', 'Las Vegas-Phoenix (2016)',
    'ksan_2016', 'Los Angeles-San Diego (2016)',
    'ksat_2016', 'Dallas-San Antonio (2016)',
    'ksea_2016', 'Same',
    'ksfo_2016', 'Bay Area (2016)',
    'ksjc_2016', 'Bay Area (2016)',
    'kstl_2016', 'St Louis-Indianapolis (2016)',
    'ktpa_2016', 'Gulf Coast (2016)',
    'ktvc_2016', 'Same'
)

# Create locMapper
locMapper_ss02 <- locMapperTibble_ss02 %>% pull(locType)
names(locMapper_ss02) <- locMapperTibble_ss02 %>% pull(source)
locMapper_ss02
##                      katl_2016                      kbos_2016 
##                         "Same"        "DC-PHL-EWR-BOS (2016)" 
##                      kdca_2016                      kden_2016 
##        "DC-PHL-EWR-BOS (2016)"                         "Same" 
##                      kdfw_2016                      kdtw_2016 
##    "Dallas-San Antonio (2016)"          "Cold Midwest (2016)" 
##                      kewr_2016                      kgrb_2016 
##        "DC-PHL-EWR-BOS (2016)"          "Cold Midwest (2016)" 
##                      kgrr_2016                      kiah_2016 
##          "Cold Midwest (2016)"            "Gulf Coast (2016)" 
##                      kind_2016                      klas_2016 
## "St Louis-Indianapolis (2016)"     "Las Vegas-Phoenix (2016)" 
##                      klax_2016                      klnk_2016 
## "Los Angeles-San Diego (2016)"                         "Same" 
##                      kmia_2016                      kmke_2016 
##            "Gulf Coast (2016)"          "Cold Midwest (2016)" 
##                      kmsn_2016                      kmsp_2016 
##          "Cold Midwest (2016)"                         "Same" 
##                      kmsy_2016                      kord_2016 
##            "Gulf Coast (2016)"          "Cold Midwest (2016)" 
##                      kphl_2016                      kphx_2016 
##        "DC-PHL-EWR-BOS (2016)"     "Las Vegas-Phoenix (2016)" 
##                      ksan_2016                      ksat_2016 
## "Los Angeles-San Diego (2016)"    "Dallas-San Antonio (2016)" 
##                      ksea_2016                      ksfo_2016 
##                         "Same"              "Bay Area (2016)" 
##                      ksjc_2016                      kstl_2016 
##              "Bay Area (2016)" "St Louis-Indianapolis (2016)" 
##                      ktpa_2016                      ktvc_2016 
##            "Gulf Coast (2016)"                         "Same"
# Create the data file with locType
locData_ss02 <- metarData %>% 
    mutate(locType=ifelse(locMapper_ss02[source]=="Same", locale, locMapper_ss02[source]), 
           hr=lubridate::hour(dtime)) %>% 
    filter(year==2016, locType !="Exclude")

# Counts by locType
locData_ss02 %>%
    count(locType) %>%
    arrange(-n) %>%
    as.data.frame()
##                         locType     n
## 1           Cold Midwest (2016) 52566
## 2             Gulf Coast (2016) 35074
## 3         DC-PHL-EWR-BOS (2016) 34915
## 4      Las Vegas-Phoenix (2016) 17543
## 5     Dallas-San Antonio (2016) 17537
## 6  Los Angeles-San Diego (2016) 17536
## 7               Bay Area (2016) 17526
## 8  St Louis-Indianapolis (2016) 17422
## 9            Atlanta, GA (2016)  8772
## 10            Denver, CO (2016)  8772
## 11       Minneapolis, MN (2016)  8769
## 12     Traverse City, MI (2016)  8766
## 13           Lincoln, NE (2016)  8765
## 14           Seattle, WA (2016)  8765

The data are then sampled such that each locType is equally prevalent:

# Set a seed for reporducibility
set.seed(2007041447)

# Find the smallest locale type
nSmall <- locData_ss02 %>%
    filter(!is.na(locType)) %>%
    count(locType) %>%
    pull(n) %>%
    min()

# Create the relevant data subset
ss02_data <- locData_ss02 %>%
    filter(!is.na(locType)) %>%
    group_by(locType) %>%
    sample_n(size=nSmall, replace=FALSE) %>%
    ungroup()

# Sumarize the data subset
ss02_data %>% 
    count(locale) %>% 
    arrange(-n) %>%
    as.data.frame()
##                      locale    n
## 1        Atlanta, GA (2016) 8765
## 2         Denver, CO (2016) 8765
## 3        Lincoln, NE (2016) 8765
## 4    Minneapolis, MN (2016) 8765
## 5        Seattle, WA (2016) 8765
## 6  Traverse City, MI (2016) 8765
## 7   Indianapolis, IN (2016) 4477
## 8      San Diego, CA (2016) 4411
## 9  San Francisco, CA (2016) 4399
## 10        Dallas, TX (2016) 4393
## 11     Las Vegas, NV (2016) 4391
## 12       Phoenix, AZ (2016) 4374
## 13   San Antonio, TX (2016) 4372
## 14      San Jose, CA (2016) 4366
## 15   Los Angeles, CA (2016) 4354
## 16   Saint Louis, MO (2016) 4288
## 17    Washington, DC (2016) 2204
## 18   New Orleans, LA (2016) 2197
## 19       Houston, TX (2016) 2196
## 20  Philadelphia, PA (2016) 2196
## 21     Tampa Bay, FL (2016) 2187
## 22        Boston, MA (2016) 2186
## 23         Miami, FL (2016) 2185
## 24        Newark, NJ (2016) 2179
## 25       Chicago, IL (2016) 1525
## 26       Madison, WI (2016) 1468
## 27     Green Bay, WI (2016) 1453
## 28  Grand Rapids, MI (2016) 1445
## 29     Milwaukee, WI (2016) 1442
## 30       Detroit, MI (2016) 1432

A random forest can then be run to predict locType (using a forest of 50 trees):

# Run random forest for subset data
rf_s02_nt50_TDmcwhsa <- rfMultiLocale(ss02_data, 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP", "Altimeter"
                                              ),
                                      locs=NULL, 
                                      locVar="locType",
                                      pred="locType",
                                      otherVar=c("dtime", "locale", "source"),
                                      ntree=50, 
                                      seed=2007041449, 
                                      mtry=4
                                      )
## 
## Running for locations:
##  [1] "Atlanta, GA (2016)"           "Bay Area (2016)"             
##  [3] "Cold Midwest (2016)"          "Dallas-San Antonio (2016)"   
##  [5] "DC-PHL-EWR-BOS (2016)"        "Denver, CO (2016)"           
##  [7] "Gulf Coast (2016)"            "Las Vegas-Phoenix (2016)"    
##  [9] "Lincoln, NE (2016)"           "Los Angeles-San Diego (2016)"
## [11] "Minneapolis, MN (2016)"       "Seattle, WA (2016)"          
## [13] "St Louis-Indianapolis (2016)" "Traverse City, MI (2016)"

Accuracy can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_s02_nt50_TDmcwhsa, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP", 
                keyVar="locType", 
                locOrder=NULL
                )

## # A tibble: 193 x 5
##    locale             predicted                    correct     n     pct
##    <fct>              <fct>                        <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)           TRUE     2026 0.769  
##  2 Atlanta, GA (2016) Bay Area (2016)              FALSE      94 0.0357 
##  3 Atlanta, GA (2016) Cold Midwest (2016)          FALSE      13 0.00493
##  4 Atlanta, GA (2016) Dallas-San Antonio (2016)    FALSE     117 0.0444 
##  5 Atlanta, GA (2016) DC-PHL-EWR-BOS (2016)        FALSE      48 0.0182 
##  6 Atlanta, GA (2016) Denver, CO (2016)            FALSE      12 0.00455
##  7 Atlanta, GA (2016) Gulf Coast (2016)            FALSE      86 0.0326 
##  8 Atlanta, GA (2016) Las Vegas-Phoenix (2016)     FALSE      33 0.0125 
##  9 Atlanta, GA (2016) Lincoln, NE (2016)           FALSE      16 0.00607
## 10 Atlanta, GA (2016) Los Angeles-San Diego (2016) FALSE      54 0.0205 
## # ... with 183 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s02_nt50_TDmcwhsa$rfModel)

# Evaluate error evolution
errorEvolution(rf_s02_nt50_TDmcwhsa, useCategory="OOB", subT="Subset 02: 14 Groups of 1-6 Locales")

## # A tibble: 50 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.519
##  2     2 OOB      0.519
##  3     3 OOB      0.513
##  4     4 OOB      0.505
##  5     5 OOB      0.492
##  6     6 OOB      0.479
##  7     7 OOB      0.468
##  8     8 OOB      0.457
##  9     9 OOB      0.447
## 10    10 OOB      0.435
## # ... with 40 more rows

Overall accuracy continues to improve. Further exploration may allow for continued evolution of the archetypes, particularly as related to the colder cities where accuracy is not improving with consolidation to archetypes.

Data are limited to only locales that do not meaningfully overlap with ‘Cold Midwest’. Functions are written to create the mapper, associated dataset, and associated data sample:

# Create the locale mapper
locMapperTibble_ss03 <- tibble::tribble(
    ~source, ~locType,
    'katl_2016', 'Same',
    'kbos_2016', 'Exclude',
    'kdca_2016', 'Exclude',
    'kden_2016', 'Same',
    'kdfw_2016', 'Dallas-San Antonio (2016)',
    'kdtw_2016', 'Exclude',
    'kewr_2016', 'Exclude',
    'kgrb_2016', 'Exclude',
    'kgrr_2016', 'Exclude',
    'kiah_2016', 'Gulf Coast (2016)',
    'kind_2016', 'Exclude',
    'klas_2016', 'Las Vegas-Phoenix (2016)',
    'klax_2016', 'Los Angeles-San Diego (2016)',
    'klnk_2016', 'Exclude',
    'kmia_2016', 'Gulf Coast (2016)',
    'kmke_2016', 'Exclude',
    'kmsn_2016', 'Exclude',
    'kmsp_2016', 'Exclude',
    'kmsy_2016', 'Gulf Coast (2016)',
    'kord_2016', 'Exclude',
    'kphl_2016', 'Exclude',
    'kphx_2016', 'Las Vegas-Phoenix (2016)',
    'ksan_2016', 'Los Angeles-San Diego (2016)',
    'ksat_2016', 'Dallas-San Antonio (2016)',
    'ksea_2016', 'Same',
    'ksfo_2016', 'Bay Area (2016)',
    'ksjc_2016', 'Bay Area (2016)',
    'kstl_2016', 'Exclude',
    'ktpa_2016', 'Gulf Coast (2016)',
    'ktvc_2016', 'Exclude'
)


# Create the locale mapper
locMapper_ss03 <- createLocMapper(locMapperTibble_ss03)

# Create the data file with locType
locData_ss03 <- applyLocMapper(metarData, mapper=locMapper_ss03, yearsUse=c(2016))
##                        locType     n
## 1            Gulf Coast (2016) 35074
## 2     Las Vegas-Phoenix (2016) 17543
## 3    Dallas-San Antonio (2016) 17537
## 4 Los Angeles-San Diego (2016) 17536
## 5              Bay Area (2016) 17526
## 6           Atlanta, GA (2016)  8772
## 7            Denver, CO (2016)  8772
## 8           Seattle, WA (2016)  8765
# Create the smaller data file that is balanced by locType
ss03_data <- createBalancedSample(locData_ss03, seed=2007051311)
##                      locale    n
## 1        Atlanta, GA (2016) 8765
## 2         Denver, CO (2016) 8765
## 3        Seattle, WA (2016) 8765
## 4      Las Vegas, NV (2016) 4406
## 5      San Diego, CA (2016) 4399
## 6  San Francisco, CA (2016) 4398
## 7    San Antonio, TX (2016) 4384
## 8         Dallas, TX (2016) 4381
## 9       San Jose, CA (2016) 4367
## 10   Los Angeles, CA (2016) 4366
## 11       Phoenix, AZ (2016) 4359
## 12   New Orleans, LA (2016) 2219
## 13     Tampa Bay, FL (2016) 2188
## 14       Houston, TX (2016) 2179
## 15         Miami, FL (2016) 2179

A random forest can then be run on both the individual locales (locData_ss03) and the balanced sample of locale types (ss03_data):

# Run random forest for subset data
rf_s03_nt100_TDmcwhsa <- rfMultiLocale(ss03_data, 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP", "Altimeter"
                                              ),
                                      locs=NULL, 
                                      locVar="locType",
                                      pred="locType",
                                      otherVar=c("dtime", "locale", "source"),
                                      ntree=100, 
                                      seed=2007051315, 
                                      mtry=4
                                      )
## 
## Running for locations:
## [1] "Atlanta, GA (2016)"           "Bay Area (2016)"             
## [3] "Dallas-San Antonio (2016)"    "Denver, CO (2016)"           
## [5] "Gulf Coast (2016)"            "Las Vegas-Phoenix (2016)"    
## [7] "Los Angeles-San Diego (2016)" "Seattle, WA (2016)"

Accuracy can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_s03_nt100_TDmcwhsa, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP", 
                keyVar="locType", 
                locOrder=NULL
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 63 x 5
##    locale             predicted                    correct     n     pct
##    <fct>              <fct>                        <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)           TRUE     2156 0.834  
##  2 Atlanta, GA (2016) Bay Area (2016)              FALSE      72 0.0279 
##  3 Atlanta, GA (2016) Dallas-San Antonio (2016)    FALSE     144 0.0557 
##  4 Atlanta, GA (2016) Denver, CO (2016)            FALSE      22 0.00851
##  5 Atlanta, GA (2016) Gulf Coast (2016)            FALSE      79 0.0306 
##  6 Atlanta, GA (2016) Las Vegas-Phoenix (2016)     FALSE      36 0.0139 
##  7 Atlanta, GA (2016) Los Angeles-San Diego (2016) FALSE      39 0.0151 
##  8 Atlanta, GA (2016) Seattle, WA (2016)           FALSE      37 0.0143 
##  9 Bay Area (2016)    Atlanta, GA (2016)           FALSE      33 0.0124 
## 10 Bay Area (2016)    Bay Area (2016)              TRUE     2228 0.837  
## # ... with 53 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s03_nt100_TDmcwhsa$rfModel)

# Evaluate error evolution
errorEvolution(rf_s03_nt100_TDmcwhsa, 
               useCategory="OOB", 
               subT="Exclude Locales with High Overlap to Cold Midwest"
               )

## # A tibble: 100 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.327
##  2     2 OOB      0.331
##  3     3 OOB      0.322
##  4     4 OOB      0.312
##  5     5 OOB      0.301
##  6     6 OOB      0.290
##  7     7 OOB      0.281
##  8     8 OOB      0.268
##  9     9 OOB      0.256
## 10    10 OOB      0.250
## # ... with 90 more rows

Predictions continue to be highly accurate for Denver, Seattle, and Las Vegas-Phoenix. There is some mixing of the Pacific coastal cities - Bay Area sometimes as Seattle or LA/San Diego, and LA/San Diego sometimes as Bay Area. There is also some meaningful overlap among Dallas-San Antonio, Gulf Coast, and Atlanta.

The process can then be repeated for the individual cities:

# Run random forest for subset data
rf_l03_nt100_TDmcwhsa <- rfMultiLocale(locData_ss03, 
                                      vrbls=c("TempF", "DewF", 
                                              "month", "hr",
                                              "minHeight", "ceilingHeight", 
                                              "WindSpeed", "predomDir",
                                              "modSLP", "Altimeter"
                                              ),
                                      locs=NULL, 
                                      locVar="locale",
                                      pred="locale",
                                      otherVar=c("dtime", "locale", "source"),
                                      ntree=100, 
                                      seed=2007051325, 
                                      mtry=4
                                      )
## 
## Running for locations:
##  [1] "Atlanta, GA (2016)"       "Dallas, TX (2016)"       
##  [3] "Denver, CO (2016)"        "Houston, TX (2016)"      
##  [5] "Las Vegas, NV (2016)"     "Los Angeles, CA (2016)"  
##  [7] "Miami, FL (2016)"         "New Orleans, LA (2016)"  
##  [9] "Phoenix, AZ (2016)"       "San Antonio, TX (2016)"  
## [11] "San Diego, CA (2016)"     "San Francisco, CA (2016)"
## [13] "San Jose, CA (2016)"      "Seattle, WA (2016)"      
## [15] "Tampa Bay, FL (2016)"

Accuracy can then be assessed:

# Evaluate prediction accuracy
evalPredictions(rf_l03_nt100_TDmcwhsa, 
                plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP", 
                keyVar="locale", 
                locOrder=NULL
                )
## Warning: Removed 1 rows containing missing values (geom_text).

## # A tibble: 216 x 5
##    locale             predicted              correct     n     pct
##    <fct>              <fct>                  <lgl>   <int>   <dbl>
##  1 Atlanta, GA (2016) Atlanta, GA (2016)     TRUE     2064 0.775  
##  2 Atlanta, GA (2016) Dallas, TX (2016)      FALSE     125 0.0469 
##  3 Atlanta, GA (2016) Denver, CO (2016)      FALSE      23 0.00864
##  4 Atlanta, GA (2016) Houston, TX (2016)     FALSE      41 0.0154 
##  5 Atlanta, GA (2016) Las Vegas, NV (2016)   FALSE      27 0.0101 
##  6 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE      49 0.0184 
##  7 Atlanta, GA (2016) Miami, FL (2016)       FALSE      18 0.00676
##  8 Atlanta, GA (2016) New Orleans, LA (2016) FALSE      34 0.0128 
##  9 Atlanta, GA (2016) Phoenix, AZ (2016)     FALSE      24 0.00901
## 10 Atlanta, GA (2016) San Antonio, TX (2016) FALSE      54 0.0203 
## # ... with 206 more rows
# Evaluate variable importance
helperPlotVarImp(rf_l03_nt100_TDmcwhsa$rfModel)

# Evaluate error evolution
errorEvolution(rf_l03_nt100_TDmcwhsa, 
               useCategory="OOB", 
               subT="Exclude Locales with High Overlap to Cold Midwest"
               )

## # A tibble: 100 x 3
##    ntree Category Error
##    <int> <chr>    <dbl>
##  1     1 OOB      0.442
##  2     2 OOB      0.440
##  3     3 OOB      0.431
##  4     4 OOB      0.428
##  5     5 OOB      0.418
##  6     6 OOB      0.406
##  7     7 OOB      0.394
##  8     8 OOB      0.380
##  9     9 OOB      0.368
## 10    10 OOB      0.358
## # ... with 90 more rows

At a glance, the combinations chosen from the individual cities seem sensible given how the individual cities tend to (mis)classify.

Random forests can also be used to run regressions, such as on variables like temperature or dew point. Suppose that a few data elements are available, and the goal is to predict the temmperature based on those:

# Build a regression dataset with factored variables for locale and source
regrData <- metarData %>%
    mutate(hr=lubridate::hour(lubridate::round_date(dtime, unit="1 hour")),
           hrfct=factor(hr), 
           localefct=factor(locale), 
           locName=str_replace(locale, pattern=" \\(\\d{4}\\)", replacement=""), 
           locNamefct=factor(locName)
           )

# Define a dependent variable and predictor variables and filtering criteria
depVar <- "TempF"
predVars <- c("DewF", "month", "hrfct", "locNamefct")
critFilter <- list("year"=c(2016), 
                   "locNamefct"=c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA")
                   )

# Filter such that only non-NA data across all of depVar and predVars are included
subRegData <- regrData %>%
    filter_at(vars(all_of(c(depVar, predVars, names(critFilter)))), all_vars(!is.na(.)))

# Filter such that only matches to critFilter are included
for (xNum in seq_len(length(critFilter))) {
    subRegData <- subRegData %>%
        filter_at(vars(all_of(names(critFilter)[xNum])), ~. %in% critFilter[[xNum]])
}

# Split in to a test set and a train set
testSize <- 0.3
set.seed(2007151232)
idxTrain <- sample(1:nrow(subRegData), size=round((1-testSize)*nrow(subRegData)), replace=FALSE)
trainRegData <- subRegData[idxTrain, ]  # will be in random order
testRegData <- subRegData[-idxTrain, ]  # will be in sorted order

# Run random forest regression
rfRegrData <- randomForest::randomForest(x=trainRegData[, predVars, drop=FALSE], 
                                         y=trainRegData[, depVar, drop=TRUE], 
                                         ntree=25, 
                                         mtry=2
                                         )

# Assess variable importance
rfRegrData %>% 
    caret::varImp()
##              Overall
## DewF       1940150.5
## month      2886275.7
## hrfct       543114.8
## locNamefct 1848806.0
# Assess evolution of OOB r-squared and RMSE
tibble::tibble(ntree=1:25, rmse=rfRegrData$mse**0.5) %>%
    ggplot(aes(x=ntree, y=rmse)) + 
    geom_point() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="RMSE", title="Evolution of RMSE")

tibble::tibble(ntree=1:25, rmse=rfRegrData$rsq) %>%
    ggplot(aes(x=ntree, y=rmse)) + 
    geom_point() + 
    ylim(c(NA, 1)) + 
    labs(x="# Trees", y="R-squared", title="Evolution of R-squared")

# Apply to test-set data and report metrics
rfPreds <- testRegData %>%
    mutate(pred=predict(rfRegrData, newdata=testRegData[, predVars, drop=FALSE]), 
           globMean=mean(get(depVar))
           ) %>%
    group_by(locale) %>%
    mutate(locMean=mean(get(depVar))) %>%
    ungroup() %>%
    mutate(errActualGlob=(get(depVar)-globMean)**2,
           errActualLoc=(get(depVar)-locMean)**2, 
           errPredict=(get(depVar)-pred)**2
           )

# Calculate the errors by locale
rfError <- rfPreds %>%
    select(locale, starts_with("err")) %>%
    group_by(locale) %>%
    summarize_all(~mean(.)**0.5)

# Plot errors by locale
rfError %>%
    pivot_longer(-locale, names_to="errorType", values_to="rmse") %>%
    ggplot(aes(x=locale, y=rmse, fill=errorType)) + 
        geom_col(position="dodge") + 
    labs(x="", y="Average Error", title="Prediction Accuracy by Locale") + 
    scale_fill_discrete("Actual vs.", 
                        c("errActualGlob", "errActualLoc", "errPredict"), 
                        c("Global Mean", "Locale Mean", "Predicted")
                        )

# Print errors by locale, including percentage changes
rfError %>%
    mutate(pctLocGlob=errActualLoc/errActualGlob, 
           pctPredLoc=errPredict/errActualLoc, 
           pctPredGlob=errPredict/errActualGlob
           )
## # A tibble: 4 x 7
##   locale errActualGlob errActualLoc errPredict pctLocGlob pctPredLoc pctPredGlob
##   <chr>          <dbl>        <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
## 1 Chica~         24.5         20.7        5.47      0.844      0.264       0.223
## 2 Las V~         19.5         18.5        5.74      0.951      0.310       0.295
## 3 New O~         14.9         13.1        3.69      0.879      0.282       0.248
## 4 San D~          6.82         6.82       3.41      1.00       0.499       0.499

A simple random forest significantly reduces null error in predicting temperatures, particularly for Chicago, Las Vegas, and New Orleans. This suggests knowing the dew point and the month enhance predictions for seasonal locales, while being of less use in a marine climate with modest seasonal variation.

Note that San Diego has more accurate predictions than any other locale, it is just that San Diego does not need much more than locale to be accurate. In contrast, Chicago averages ~25 degree error if using global mean, ~20 degrees error if using local mean, and ~5 degrees error if using dew point, month, and hour.

Notably, it appears that running with just a single tree explains nearly 90% of the variance and drives average prediction errors down to under 6 degrees. While there is evolution with additional trees, it appears modest. Potentially, large forests may not be less necessary for this regression analysis phase.

The above process is converted to functional form, with the main function running the filtering, then calling a modified rfTwoLocale() that is adapted to handle classification or regression. The function reRegression() is available in WeatherModelingFunctions_v001.R.

The regression is run again using the same arguments with the new function, and results are cached:

rfRegrData2 <- rfRegression(regrData, 
                            depVar=depVar, 
                            predVars=predVars, 
                            critFilter=critFilter, 
                            seed=2007151232, 
                            ntree=25, 
                            mtry=2, 
                            testSize=0.3
                            )

The general outcomes are nearly identical:

# Assess variable importance
rfRegrData2$rfModel %>% 
    caret::varImp()
##              Overall
## DewF       1878740.7
## month      2945222.5
## hrfct       545891.1
## locNamefct 1886122.8
# Assess evolution of OOB r-squared and RMSE
tibble::tibble(ntree=1:25, rmse=rfRegrData2$mse**0.5) %>%
    ggplot(aes(x=ntree, y=rmse)) + 
    geom_point() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="RMSE", title="Evolution of RMSE")

tibble::tibble(ntree=1:25, rmse=rfRegrData2$rsq) %>%
    ggplot(aes(x=ntree, y=rmse)) + 
    geom_point() + 
    ylim(c(NA, 1)) + 
    labs(x="# Trees", y="R-squared", title="Evolution of R-squared")

An evaluation function suitable for regresion is written, to cover the following metrics:

  1. Variable importance
  2. Evolution of RMSE and r-squared with number of trees
  3. RMSE and r-squared in the test dataset plotted by key variable (locale)
  4. Actual vs. predicted, optionally filtered and facetted, with an lm smooth and y=x abline

Each sub-section is written as its own function, with a wrapper function that calls then all together. The below functions are available in WeatherModelingFunctions_v001.R:

  • plotVariableImportance - plot of caret::varImp()
  • plotMSERSQ - plot of RMSE and R-squared evolution with number of trees
  • plotErrorByVar - plot of RMSE and R-squared by a user-defined key variable
  • plotActualVsPredicted - plot of actual vs, predicted, optionally facetted and filtered
  • evalRFRegression - calls the four functions above

Example code for running the functions alone and together includes:

# Run function #1
plotVariableImportance(rfRegrData, 
                       returnData=TRUE, 
                       caption="temperature ~ f(locale, month, hour, dew point)", 
                       titleAdd=": Predicting Temperature"
                       )

## # A tibble: 4 x 2
##   predictor   Overall
##   <chr>         <dbl>
## 1 DewF       1940151.
## 2 month      2886276.
## 3 hrfct       543115.
## 4 locNamefct 1848806.
# Run function #2
plotMSERSQ(rfRegrData, returnData=TRUE)

## # A tibble: 25 x 3
##    ntree  rmse   rsq
##    <int> <dbl> <dbl>
##  1     1  5.78 0.891
##  2     2  5.64 0.897
##  3     3  5.51 0.901
##  4     4  5.39 0.906
##  5     5  5.29 0.909
##  6     6  5.23 0.911
##  7     7  5.19 0.913
##  8     8  5.14 0.914
##  9     9  5.07 0.916
## 10    10  5.04 0.917
## # ... with 15 more rows
# Run function #3
plotErrorByVar(rfRegrData2, depVar="TempF", keyVar="locNamefct", returnData=TRUE)

## # A tibble: 4 x 5
##   locNamefct      varTot varMod  rmse   rsq
##   <fct>            <dbl>  <dbl> <dbl> <dbl>
## 1 Chicago, IL      427.    29.9  5.46 0.930
## 2 Las Vegas, NV    343.    32.9  5.73 0.904
## 3 New Orleans, LA  171.    13.7  3.70 0.920
## 4 San Diego, CA     46.6   11.8  3.43 0.747
# Run function #4
plotActualVsPredicted(rfRegrData2, depVar="TempF", facetVar="locNamefct", returnData=TRUE)

## # A tibble: 10,495 x 9
##    actual source dtime                DewF month hrfct locNamefct predicted
##     <dbl> <chr>  <dttm>              <dbl> <fct> <fct> <fct>          <dbl>
##  1   41   klas_~ 2016-01-01 03:56:00  8.06 Jan   4     Las Vegas~      42.9
##  2   39.0 klas_~ 2016-01-01 04:56:00  8.06 Jan   5     Las Vegas~      37.5
##  3   32   klas_~ 2016-01-01 10:56:00  8.96 Jan   11    Las Vegas~      34.8
##  4   30.0 klas_~ 2016-01-01 11:56:00  8.96 Jan   12    Las Vegas~      35.1
##  5   42.1 klas_~ 2016-01-01 18:56:00  8.96 Jan   19    Las Vegas~      45.4
##  6   46.0 klas_~ 2016-01-01 20:56:00  8.96 Jan   21    Las Vegas~      49.6
##  7   48.0 klas_~ 2016-01-01 22:56:00 10.9  Jan   23    Las Vegas~      53.8
##  8   45.0 klas_~ 2016-01-02 01:56:00 10.9  Jan   2     Las Vegas~      49.4
##  9   41   klas_~ 2016-01-02 02:56:00 12.9  Jan   3     Las Vegas~      51.0
## 10   39.9 klas_~ 2016-01-02 08:56:00 15.1  Jan   9     Las Vegas~      43.5
## # ... with 10,485 more rows, and 1 more variable: correct <lgl>
# Run combined function
rfOut_002 <- evalRFRegression(rfRegrData2, 
                              depVar="TempF", 
                              keyVar="locNamefct", 
                              facetVar="locNamefct", 
                              caption="Temp ~ f(locale, month, dew point, hour)",
                              returnData=TRUE
                              )

The random forest is then run for an expanded list of locales in 2016, with results cached:

keyLocs3 <- c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA", "Atlanta, GA", "Denver, CO", "Dallas, TX", "Miami, FL", "Phoenix, AZ", "Seattle, WA", "Boston, MA", "San Francisco, CA")

# Run for select 2016 data
rfRegrData3 <- rfRegression(regrData, 
                            depVar="TempF", 
                            predVars=c("DewF", "month", "hrfct", "locNamefct"), 
                            critFilter=list(year=2016, locNamefct=keyLocs3), 
                            seed=2007171405, 
                            ntree=20, 
                            mtry=2, 
                            testSize=0.3
                            )

The results can then be evaluated:

rfOut_003 <- evalRFRegression(rfRegrData3, 
                              depVar="TempF", 
                              keyVar="locNamefct", 
                              facetVar="locNamefct", 
                              caption="Temp ~ f(locale, month, dew point, hour)",
                              returnData=TRUE
                              )

Month, locale, and dewpoint are all relatively high in variable importance, while hour is low. This is somewhat surprising given that there are usually diurnal temperature changes, though the results suggest that month and dew point have greater impact on temperatures. Denver is the hardest locale for the model to predict.

The process is run again with hour excluded to see how that impacts the results:

keyLocs3a <- c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA", "Atlanta, GA", "Denver, CO", "Dallas, TX", "Miami, FL", "Phoenix, AZ", "Seattle, WA", "Boston, MA", "San Francisco, CA")

# Run for select 2016 data
rfRegrData3a <- rfRegression(regrData, 
                             depVar="TempF", 
                             predVars=c("DewF", "month", "locNamefct"), 
                             otherVar=c("source", "dtime", "hrfct"),
                             critFilter=list(year=2016, locNamefct=keyLocs3a), 
                             seed=2007171405, 
                             ntree=20, 
                             mtry=2, 
                             testSize=0.3
                             )

And, the results are again evaluated:

rfOut_003a <- evalRFRegression(rfRegrData3a, 
                               depVar="TempF", 
                               keyVar="locNamefct", 
                               facetVar="locNamefct", 
                               caption="Temp ~ f(locale, month, dew point)",
                               returnData=TRUE
                               )

Run time is significantly faster with the elimination of the 24-category factor, hour. Interestingly, there is no longer mush change in RMSE or R-squared with more trees, suggestive that the main work being done in the previous model is to incorporate the impact of hour. Removing hour meaningfullky degrades both RMSE (~1.5 degrees more delta) and R-squared (~5% less variance explained). Several locales no longer have prediction smooths that follow the 45-degree line, suggesting that hour may be especially helpful in specific locales. Next steps are to explore this further.

Single-variable regressions are run to assess performance, using 1) only dew point, 2) only month, 3) only locale, 4) only hour, 5) only sea-level pressure, and 6) only altimeter:

# Set up the eplanatory variables and other variables to keep from 'testData
varSingles <- c("DewF", "month", "locNamefct", "hrfct", "modSLP", "Altimeter")
otherVars <- c("source", "dtime", "DewF", "month", "locNamefct", "hrfct", "modSLP", "Altimeter")

# Set up listContainerRFReg to hold results
listContainerRFReg <- vector("list", 6)
names(listContainerRFReg) <- varSingles

n <- 1
for (thisVar in varSingles) {
    
    # Run a single-variable regression for TempF vs. thisVar (should not evolve with number of trees)
    listContainerRFReg[[n]] <- rfRegression(regrData, 
                                            depVar="TempF", 
                                            predVars=thisVar, 
                                            otherVar=otherVars,
                                            critFilter=list(year=2016, locNamefct=keyLocs3a), 
                                            seed=2007181302+n, 
                                            ntree=5, 
                                            mtry=1, 
                                            testSize=0.3
                                            )
    
    # Increment n
    n <- n + 1
    
}

# R-squared of overall model by model type
rsq <- sapply(listContainerRFReg, FUN=function(x) x$rsq) %>%
    tibble::as_tibble() %>%
    mutate(ntree=row_number())

# Plot of final R-squared
rsq %>%
    filter(ntree==max(ntree)) %>%
    pivot_longer(-ntree, names_to="var", values_to="rsq") %>%
    ggplot(aes(x=fct_reorder(paste0(var, "\n", varMapper[var]), rsq), y=rsq)) +
    geom_col(fill="lightblue") + 
    geom_text(aes(y=rsq/2, label=round(rsq, 3))) + 
    coord_flip() +
    labs(x="", 
         y="Overall R-squared", 
         title="Predicting temperature using only a single variable"
         )

At an overall level, month and dew point are the most helpful in differentiating temperatures, followed by locale. Hour is the least helpful, and appears to add value only when other variables have already been included.

How is accuracy in the varying locales impacted by the variable used?

# Impact for just using month
plotErrorByVar(listContainerRFReg$month, 
               depVar="TempF", 
               keyVar="locNamefct", 
               caption="Temperature ~ f(month)"
               )
## Warning: Removed 4 rows containing missing values (position_stack).

# Impact for just using dew point
plotErrorByVar(listContainerRFReg$DewF, 
               depVar="TempF", 
               keyVar="locNamefct", 
               caption="Temperature ~ f(dewpoint)"
               )
## Warning: Removed 4 rows containing missing values (position_stack).

# Impact for just using locale
plotErrorByVar(listContainerRFReg$locNamefct, 
               depVar="TempF", 
               keyVar="month", 
               caption="Temperature ~ f(locale)"
               )
## Warning: Removed 7 rows containing missing values (position_stack).

The issues with using a single predictor are obvious, as many of R-squared by dimension become negative. Specifically:

  • Using month-only as a predictor works best for Atlanta, Dallas, and Las Vegas, while driving worse than null performance for Miami, San Francisco, and San Diego
  • Using dewpoint-only as a predictor works best for New Orleans, Atlanta, and Dallas, while driving worse than null performance for the desert and marine locales
  • Using locale-only as a predictor works best in March-May and October-November, while driving worse than null performance for summer and winter months

The month-only result is explored in more detail:

# Predictions based solely on month
listContainerRFReg$month$testData %>%
    group_by(locNamefct, month) %>%
    summarize(meanTemp=mean(TempF)) %>%
    group_by(month) %>%
    mutate(meanMonth=mean(meanTemp)) %>%
    ungroup() %>%
    filter(locNamefct %in% c("Atlanta, GA", "Dallas, TX", "Las Vegas, NV", 
                             "Miami, FL", "San Francisco, CA", "San Diego, CA"
                             )
           ) %>%
    ggplot(aes(x=month, y=meanTemp)) + 
    geom_line(aes(group=locNamefct, color=locNamefct)) + 
    geom_line(data=~filter(., locNamefct=="Atlanta, GA"), aes(y=meanMonth, group=1, lwd=2)) + 
    labs(x="", 
         y="Average Temperature", 
         title="Average Temperature by Month", 
         subtitle="Overall and by locale"
         ) + 
    scale_color_discrete("Locale") + 
    scale_size_continuous("Overall Average", labels=NULL)

The findings are as expected. Atlanta, Dallas, and Las Vegas have seasonal variations that broadly follow the overall average. In contrast, San Francisco and San Diego tend to be warmer than overall average in winter and colder than overall average in summer. Miami tends to be warmer than overall average in all months, and with not much montly variation.

Clearly, at least two variables are needed. Suppose that for simplification, only the two best factor variable predictors are used:

# Run a regression for TempF vs. month and locale
rfTempvMonthLocale <- rfRegression(regrData, 
                                   depVar="TempF", 
                                   predVars=c("month", "locNamefct"), 
                                   otherVar=otherVars,
                                   critFilter=list(year=2016, locNamefct=keyLocs3a), 
                                   seed=2007181357, 
                                   ntree=5, 
                                   mtry=2, 
                                   testSize=0.3
                                   )

# Evaluate simple regression using only locale and month
rfOut_tvml <- evalRFRegression(rfTempvMonthLocale, 
                               depVar="TempF", 
                               keyVar="locNamefct", 
                               facetVar="locNamefct", 
                               caption="Temp ~ f(locale, month)",
                               returnData=TRUE
                               )

# Impact on predictions by month
plotErrorByVar(rfTempvMonthLocale, 
               depVar="TempF", 
               keyVar="month", 
               caption="Temp ~ f(locale, month)"
               )

# Impact on predictions by hour
plotErrorByVar(rfTempvMonthLocale, 
               depVar="TempF", 
               keyVar="hrfct", 
               caption="Temp ~ f(locale, month)"
               )

As expected, R-squared is now positive for all elements on the locale and month dimensions. Predictions appear to be less accurate for the 19-24Z range, reflecting roughly the afternoon in the US, and to be most accurate in the 2-8 (early nighttime) and 14-17 (morning) hours. There is no evolution in RMSE or R-squared with number of trees since the model is (essentially) just finding the average temperature by locale and month.

Hour is added to the model for an additional attempt to improve accuracy:

# Run a regression for TempF vs. month and locale and hour
rfTempvMonthLocaleHour <- rfRegression(regrData, 
                                       depVar="TempF", 
                                       predVars=c("month", "locNamefct", "hrfct"), 
                                       otherVar=otherVars,
                                       critFilter=list(year=2016, locNamefct=keyLocs3a), 
                                       seed=2007181421, 
                                       ntree=5, 
                                       mtry=3, 
                                       testSize=0.3
                                       )

Evaluations are then performed:

# Evaluate simple regression using only locale and month and hour
rfOut_tvmlh <- evalRFRegression(rfTempvMonthLocaleHour, 
                                depVar="TempF", 
                                keyVar="locNamefct", 
                                facetVar="locNamefct", 
                                caption="Temp ~ f(locale, month, hour)",
                                returnData=TRUE
                                )

# Impact on predictions by month
plotErrorByVar(rfTempvMonthLocaleHour, 
               depVar="TempF", 
               keyVar="month", 
               caption="Temp ~ f(locale, month, hour)"
               )

# Impact on predictions by hour
plotErrorByVar(rfTempvMonthLocaleHour, 
               depVar="TempF", 
               keyVar="hrfct", 
               caption="Temp ~ f(locale, month, hour)"
               )

Accuracy by hour is more uniform, and the overall predictive power has increased.

Run times show the drawback of using random forest regression on such a simple problem:

# Extract the key data
lmData <- regrData %>%
    filter(year==2016, locNamefct %in% keyLocs3a)

# Extract the relevant testData
testData <- rfTempvMonthLocaleHour$testData

# Create trainData as lmData anti-joined to testData
trainData <- lmData %>%
    anti_join(select(testData, source, dtime))
## Joining, by = c("source", "dtime")
# Get the averages from train data
trainAvg <- 
    trainData %>%
    group_by(locNamefct, month, hrfct) %>%
    summarize(predMean=mean(TempF, na.rm=TRUE))

# Merge back to testData
testData <- testData %>%
    left_join(trainAvg)
## Joining, by = c("month", "locNamefct", "hrfct")
# Report on RMSE and R-squared
testData %>%
    summarize(varTot=var(TempF), varPred=var(predMean-TempF, na.rm=TRUE), 
              rsq=1-varPred/varTot, rmse=varPred**0.5
              )
## # A tibble: 1 x 4
##   varTot varPred   rsq  rmse
##    <dbl>   <dbl> <dbl> <dbl>
## 1   310.    47.8 0.846  6.92

The identical RMSE and R-squared are obtained in under a second. Note that running lm(TempF ~ locNamefct * month * hrfct, …) takes much longer than the random forest regression due to complexities related to using solve() and storing diagnostic statistics. Model selection should be driven by many factors. In this specific example, they all converge to predicting temperature as the mean for each grouping of factors.

A run of the random forest regression is then made using only a few numeric variables - dew point, sea-level pressure, and altimeter:

# Run a regression for TempF vs. dewpoint and sea-lvele pressure and altimeter
rfTempvDewSLPAltimeter <- rfRegression(regrData, 
                                       depVar="TempF", 
                                       predVars=c("DewF", "modSLP", "Altimeter"), 
                                       otherVar=otherVars,
                                       critFilter=list(year=2016, locNamefct=keyLocs3a), 
                                       seed=2007191311, 
                                       ntree=5, 
                                       mtry=3, 
                                       testSize=0.3
                                       )
# Evaluate simple regression using only dewpoint and sea-lvele pressure and altimeter
rfOut_tvdsa <- evalRFRegression(rfTempvDewSLPAltimeter, 
                                depVar="TempF", 
                                keyVar="locNamefct", 
                                facetVar="locNamefct", 
                                caption="Temp ~ f(dew point, SLP, altimeter)",
                                returnData=TRUE
                                )

## Warning: Removed 2 rows containing missing values (position_stack).

# Impact on predictions by month
plotErrorByVar(rfTempvDewSLPAltimeter, 
               depVar="TempF", 
               keyVar="month", 
               caption="Temp ~ f(dew point, SLP, altimeter)"
               )

# Impact on predictions by hour
plotErrorByVar(rfTempvDewSLPAltimeter, 
               depVar="TempF", 
               keyVar="hrfct", 
               caption="Temp ~ f(dew point, SLP, altimeter)"
               )

Dew point and sea-level pressure stand out as more important then altimeter. Notably, unlike the categorical regression, the model continues to learn with more trees suggesting that 5 is too few. Additionally, predictions are “worse than null” for San Diego and San Francsico, suggesting that the locale variable is vital for establishing differing relationships of temperature to dew point in different locales.

An additional version of the model is run taking the two highest importance variables from each of the previous regressions - locale, month, dew point, and sea-level pressure:

# Run a regression for TempF vs. month, locale, dew point, sea-level pressure
rfTempvDewSLPMonthLocale <- rfRegression(regrData, 
                                         depVar="TempF", 
                                         predVars=c("DewF", "modSLP", "month", "locNamefct"), 
                                         otherVar=otherVars,
                                         critFilter=list(year=2016, locNamefct=keyLocs3a), 
                                         seed=2007191321, 
                                         ntree=20, 
                                         mtry=2, 
                                         testSize=0.3
                                         )
# Evaluate regression using dew point, sea-level pressure, month, locale
rfOut_tvdsml <- evalRFRegression(rfTempvDewSLPMonthLocale, 
                                 depVar="TempF", 
                                 keyVar="locNamefct", 
                                 facetVar="locNamefct", 
                                 caption="Temp ~ f(dew point, SLP, month, locale)",
                                 returnData=TRUE
                                 )

# Impact on predictions by month
plotErrorByVar(rfTempvDewSLPMonthLocale, 
               depVar="TempF", 
               keyVar="month", 
               caption="Temp ~ f(dew point, SLP, month, locale)"
               )

# Impact on predictions by hour
plotErrorByVar(rfTempvDewSLPMonthLocale, 
               depVar="TempF", 
               keyVar="hrfct", 
               caption="Temp ~ f(dew point, SLP, month, locale)"
               )

Month is the most important variable, followed by locale and dew point. Sea-level pressure appears to have less influence. RMSE is down slightly and R-squared is up slightly, with predictions continuing to be hardest in Denver (high RMSE) and San Diego/San Francisco (low r-squared).

There continues to be a pattern where times in 18Z-24Z (roughly late afternoon) are predicted much worse then times in 2-8Z (roughly the first half of nighttime), potentially suggesting that including hour as an explanatory variable could help.

A version of the random forest regression is then run that adds hour and altimeter to the mix. The mtry variable is held at 2:

# Run a regression for TempF vs. six key variables
rfTempvBig6 <- rfRegression(regrData, 
                            depVar="TempF", 
                            predVars=c("DewF", "modSLP", "Altimeter", "month", "locNamefct", "hrfct"), 
                            otherVar=otherVars,
                            critFilter=list(year=2016, locNamefct=keyLocs3a), 
                            seed=2007191338, 
                            ntree=20, 
                            mtry=2, 
                            testSize=0.3
                            )
# Evaluate regression using six key variables
rfOut_tvbig6 <- evalRFRegression(rfTempvBig6, 
                                 depVar="TempF", 
                                 keyVar="locNamefct", 
                                 facetVar="locNamefct", 
                                 caption="Temp ~ f(dew point, SLP, altimeter, month, locale, hour)",
                                 returnData=TRUE
                                 )

# Impact on predictions by month
plotErrorByVar(rfTempvBig6, 
               depVar="TempF", 
               keyVar="month", 
               caption="Temp ~ f(dew point, SLP, altimeter, month, locale, hour)"
               )

# Impact on predictions by hour
plotErrorByVar(rfTempvBig6, 
               depVar="TempF", 
               keyVar="hrfct", 
               caption="Temp ~ f(dew point, SLP, altimeter, month, locale, hour)"
               )

While hour and altimeter show as lower on variable importance, overall model performance is significantly improved. Specifically, RMSE is down to ~4.5 and still falling with more trees, while r-squared is up to ~0.93 and still rising with more trees. Notably, many of the locales are now at 90%+ r-squared, with even San Diego and San Francisco now being accurately classified.

Next steps are to focus specifically on some of the tougher locales (Denver, San Francisco, San Diego) and times (20Z for example) to explore what is driving those challenges.

Since locale and month are both known to be important, do the forests run more quickly if they are run for subsets of the data?

regrSubData <- regrData %>%
    filter(year==2016, locNamefct %in% keyLocs3a)

locsRegSub <- regrSubData %>%
    pull(locNamefct) %>%
    as.character() %>%
    unique() %>%
    sort()

Forests are then run for each locale:

startTime <- Sys.time()

# Create a container to hold the output
lstRegSub <- vector("list", length(locsRegSub))
names(lstRegSub) <- locsRegSub

n <- 1
for (loc in locsRegSub) {

    # Run a regression for TempF vs. month, hour, dew point, sea-level pressure, altitude
    lstRegSub[[n]] <- rfRegression(regrSubData, 
                                   depVar="TempF", 
                                   predVars=c("DewF", "modSLP", "Altimeter", "month", "hrfct"), 
                                   otherVar=otherVars,
                                   critFilter=list(year=2016, locNamefct=loc), 
                                   seed=2007201355+n, 
                                   ntree=20, 
                                   mtry=2, 
                                   testSize=0.3
                                   )
    
    # Incerement counter
    n <- n + 1
    
    # Update progress
    cat("\nFinished processing:", loc)
    
}
## 
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Las Vegas, NV
## Finished processing: Miami, FL
## Finished processing: New Orleans, LA
## Finished processing: Phoenix, AZ
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: Seattle, WA
endTime <- Sys.time()
cat("\nTotal run time:", round(endTime - startTime, 1), "seconds\n")
## 
## Total run time: 20.4 seconds

This process runs in around 20 seconds, roughly 100x faster than running the random forest with all the data together. In this case, locale is known to be strongly predictive, so splitting it out may make sense. Does the timing change if month is also split out?

startTime <- Sys.time()

# Create a container to hold the output
lstRegSub_002 <- vector("list", length(locsRegSub)*length(month.abb))

n <- 1
for (loc in locsRegSub) {
    for (mon in month.abb) {
        # Run a regression for TempF vs. hour, dew point, sea-level pressure, altitude
        lstRegSub_002[[n]] <- rfRegression(regrSubData, 
                                           depVar="TempF", 
                                           predVars=c("DewF", "modSLP", "Altimeter", "hrfct"), 
                                           otherVar=otherVars,
                                           critFilter=list(year=2016, locNamefct=loc, month=mon), 
                                           seed=2007201405+n, 
                                           ntree=20, 
                                           mtry=2, 
                                           testSize=0.3
                                           )
    
        # Incerement counter
        n <- n + 1
    }
    
    # Update progress
    cat("\nFinished processing:", loc)
    
}
## 
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Las Vegas, NV
## Finished processing: Miami, FL
## Finished processing: New Orleans, LA
## Finished processing: Phoenix, AZ
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: Seattle, WA
endTime <- Sys.time()
cat("\nTotal run time:", round(endTime - startTime, 1), "seconds\n")
## 
## Total run time: 25.3 seconds

There is no obvious improvement in run time, suggesting that the increased overhead of the for loops may be offsetting any efficiency gains from growing much simpler trees on smaller datasets.

The data from the locale regressions is combined, for purposes of running diagnostics:

# Integrate the test data
subTestData <- map_dfr(lstRegSub, .f=function(x) { x$testData} )

# Integrate the r-squared data
subRSQ <- map_dfr(lstRegSub, 
                  .f=function(x) { tibble::tibble(ntree=1:length(x$rsq), rsq=x$rsq) }, 
                  .id="source"
                  )

# Integrate the RMSE data
subRMSE <- map_dfr(lstRegSub, 
                   .f=function(x) { tibble::tibble(ntree=1:length(x$mse), rmse=x$mse**0.5) }, 
                   .id="source"
                   )

# Plot integrated evolution of RSQ
helperPlotEvolution <- function(df, plotVar) {
    
    p1 <- df %>%
        ggplot(aes_string(x="ntree", y=plotVar)) + 
        geom_line(aes(group=source, color=source)) + 
        labs(x="Number of trees", 
             title=paste0("Evolution of ", str_to_upper(plotVar), " by locale")
             )
    if (plotVar=="rsq") { p1 <- p1 + ylim(c(NA, 1)) }
    if (plotVar=="rmse") { p1 <- p1 + ylim(c(0, NA)) }
    print(p1)
    
}

# Plot final R-squared and RMSE by locale
ebvRFSub_loc <- plotErrorByVar(list(testData=subTestData), 
                               depVar="TempF", 
                               keyVar="locNamefct", 
                               returnData=TRUE
                               )

# Plot final R-squared and RMSE by month
ebvRFSub_mon <- plotErrorByVar(list(testData=subTestData), 
                               depVar="TempF", 
                               keyVar="month", 
                               returnData=TRUE
                               )

# Plot final R-squared and RMSE by hour
ebvRFSub_hr <- plotErrorByVar(list(testData=subTestData), 
                              depVar="TempF", 
                              keyVar="hrfct", 
                              returnData=TRUE
                              )

# Show differences by model run
ebvData <- bind_rows(ebvRFSub_loc, rfOut_tvbig6$f3, .id="source") %>%
    mutate(source=ifelse(source==1, "Run by locale", "Run overall"))

# Plot differences in R-squared
ebvData %>%
    ggplot(aes(x=fct_reorder(locNamefct, rsq), y=rsq)) + 
    geom_point(aes(color=source)) + 
    coord_flip() + 
    ylim(c(NA, 1)) + 
    labs(x="", y="R-squared", title="R-squared by modeling approach") + 
    scale_color_discrete("Model")

# Plot differences in RMSE
ebvData %>%
    ggplot(aes(x=fct_reorder(locNamefct, rmse), y=rmse)) + 
    geom_point(aes(color=source)) + 
    coord_flip() + 
    ylim(c(0, NA)) + 
    labs(x="", y="RMSE", title="RMSE by modeling approach") + 
    scale_color_discrete("Model")

If anything, the models run by locale appear to be slightly more accurate than the models run overall. The differences in variable importance can also be assessed:

# Grab the variable importance for each locale
impLstRegSub <- map_dfr(lstRegSub, 
                        .f=function(x) { x$rfModel$importance %>% t() %>% tibble::as_tibble() }, 
                        .id="locNamefct"
                        )

# Get the percentage by variable, then plot
impLstRegSub %>%
    pivot_longer(-locNamefct, names_to="variable", values_to="importance") %>%
    group_by(locNamefct) %>%
    mutate(pct=importance/sum(importance)) %>%
    ungroup() %>%
    ggplot(aes(x=locNamefct, y=variable)) + 
    geom_tile(aes(fill=pct)) + 
    geom_text(aes(label=round(pct, 3))) + 
    coord_flip() + 
    labs(x="", y="", title="Variable importance by locale (scaled to add to 100% in each locale)") + 
    scale_fill_continuous("Percent", low="white", high="green")

There are some interesting differences in variable importance by locale:

  • Las Vegas and Phoenix make very little use of dewpoint, while Denver, San Diego, and San Francisco also show far less importance for this attribute
  • Las Vegas, Phoenix, and Denver use primarily month with some weight on sea-level pressure, and hour
  • San Diego and San Francisco use a rather balanced weighting of month, hour, and dew point

Portions of the difference are driven by seasonal extremes. For example, Chicago has four distinct seasons with dew point correlated to temperature in each. Thus, there is a lot of variance in Chicago temperature data, and much of that variance can be explained by dewpoint. In contrast, Las Vegas tends to have low dew points all year and the marine locales tend to have moderate dew points all year. As such, there is little explanatory power of dew point in these locales.

Another area to examine is the explanatory power for random forest regressions by a mix of two factor variables. A helper function is written to explore this - helperAccuracyByFactors() is available in file WeatherModelingFunctions_v001.R:

helperAccuracyByFactors(subTestData, xVar="locNamefct", yVar="month", depVar="TempF", metric="rsq")

helperAccuracyByFactors(subTestData, xVar="locNamefct", yVar="month", depVar="TempF", metric="rmse")

There are some rather low explanatory powers for some of these intersections. A chunk of the model’s overall explanatory power by locale is in finding differences in average temperature by month, and a few of the R-squared suggest there is not much more being learned about the within month differences. Boston in July stands out - what does the explanatory power by hour look like? Hours are grouped by three to minimize buckets with very small data:

subJuly <- subTestData %>%
    filter(month=="Jul") %>%
    mutate(hrBucket=factor(3 * (((as.integer(hrfct)-1) %% 24) %/% 3) + 1))

helperAccuracyByFactors(subJuly, xVar="locNamefct", yVar="hrBucket", depVar="TempF", metric="rsq")

There is clearly significant room for improvement, as several of the R-squared values are negative. Suppose that random forest regressions are run for each locale, using only July data:

startTime <- Sys.time()

# Create a container to hold the output
lstRegSub_jul <- vector("list", length(locsRegSub))

n <- 1
for (loc in locsRegSub) {
    for (mon in month.abb[7]) {
        # Run a regression for TempF vs. hour, dew point, sea-level pressure, altitude
        lstRegSub_jul[[n]] <- rfRegression(regrSubData, 
                                           depVar="TempF", 
                                           predVars=c("DewF", "modSLP", "Altimeter", "hrfct"), 
                                           otherVar=otherVars,
                                           critFilter=list(year=2016, locNamefct=loc, month=mon), 
                                           seed=2007211342+n, 
                                           ntree=100, 
                                           mtry=3, 
                                           testSize=0.3
                                           )
    
        # Incerement counter
        n <- n + 1
    }
    
    # Update progress
    cat("\nFinished processing:", loc)
    
}
## 
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Las Vegas, NV
## Finished processing: Miami, FL
## Finished processing: New Orleans, LA
## Finished processing: Phoenix, AZ
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: Seattle, WA
endTime <- Sys.time()
cat("\nTotal run time:", round(endTime - startTime, 1), "seconds\n")
## 
## Total run time: 8.2 seconds

The evaluation process can then be run again:

# Integrate the test data
subTestData_jul <- map_dfr(lstRegSub_jul, .f=function(x) { x$testData} )

# Plot final R-squared and RMSE by locale
ebvRFSub_loc_jul <- plotErrorByVar(list(testData=subTestData_jul), 
                                   depVar="TempF", 
                                   keyVar="locNamefct", 
                                   returnData=TRUE
                                   )

# Plot final R-squared and RMSE by hour
ebvRFSub_hr_jul <- plotErrorByVar(list(testData=subTestData_jul), 
                                  depVar="TempF", 
                                  keyVar="hrfct", 
                                  returnData=TRUE
                                  )

subJuly_002 <- subTestData_jul %>%
    filter(month=="Jul") %>%
    mutate(hrBucket=factor(3 * (((as.integer(hrfct)-1) %% 24) %/% 3) + 1))

helperAccuracyByFactors(subJuly_002, xVar="locNamefct", yVar="hrBucket", depVar="TempF", metric="rsq")

Boston remains the hardest to predict in July, and there has been only a little imporvement in RMSE or R-squared. Why is the Dallas data still showing negative R-squared?

subPlotJuly <- subJuly_002 %>%
    filter(locNamefct=="Dallas, TX", hrBucket==7)

subPlotJuly
## # A tibble: 33 x 12
##    TempF source dtime                DewF month locNamefct hrfct modSLP
##    <dbl> <chr>  <dttm>              <dbl> <fct> <fct>      <fct>  <dbl>
##  1  80.1 kdfw_~ 2016-07-01 06:53:00  66.0 Jul   Dallas, TX 7      1014.
##  2  82.0 kdfw_~ 2016-07-04 06:53:00  75.0 Jul   Dallas, TX 7      1010.
##  3  82.0 kdfw_~ 2016-07-06 07:53:00  72.0 Jul   Dallas, TX 8      1011.
##  4  82.9 kdfw_~ 2016-07-11 06:53:00  70.0 Jul   Dallas, TX 7      1010.
##  5  84.0 kdfw_~ 2016-07-12 05:53:00  66.9 Jul   Dallas, TX 6      1010.
##  6  81.0 kdfw_~ 2016-07-12 07:53:00  70.0 Jul   Dallas, TX 8      1010.
##  7  84.9 kdfw_~ 2016-07-13 05:53:00  70.0 Jul   Dallas, TX 6      1012.
##  8  84.0 kdfw_~ 2016-07-13 06:53:00  71.1 Jul   Dallas, TX 7      1012.
##  9  84.0 kdfw_~ 2016-07-14 07:53:00  71.1 Jul   Dallas, TX 8      1013.
## 10  82.0 kdfw_~ 2016-07-16 05:53:00  68   Jul   Dallas, TX 6      1014.
## # ... with 23 more rows, and 4 more variables: Altimeter <dbl>,
## #   predicted <dbl>, correct <lgl>, hrBucket <fct>
subPlotJuly %>%
    ggplot(aes(x=TempF, y=predicted)) + 
    geom_point() + 
    geom_abline(lty=2, color="red") + 
    labs(x="Actual Temperature", y="Predicted Temperature") + 
    geom_text(data=~filter(., TempF < 78), aes(x=TempF+0.25, label=dtime), hjust=0, color="red") +
    coord_equal()

The model appears to be reasonable in this case. There is an anomalously low temperature of 75 degrees at 6Z on July 10. The model is predicts a value that is roughly in-line with what would typically be expected for Dallas in July at midnight. A plot of Dallas Temperatures and Dew Points in July suggests that while low temperatures in the mid-70s are rare, they occur several times:

regrData %>% 
    filter(source=="kdfw_2016", month=="Jul", !is.na(TempF), !is.na(DewF)) %>% 
    ggplot(aes(x=dtime, y=TempF)) + 
    geom_line(color="red") + 
    geom_line(aes(y=DewF), color="blue") + 
    labs(x="", y="Temp/Dew (F)", title="Dallas, TX in July 2016")

The dew point does not help predict these low temperature evenings. Suppose 76 degrees is used as the cutoff for a “cold” July occurrence in Dallas. Does anything stand out about these times?

dfw_jul <- regrData %>% 
    filter(source=="kdfw_2016", month=="Jul", !is.na(TempF), !is.na(DewF)) %>% 
    mutate(isCold=TempF <= 76.5) 

dfw_jul%>% 
    select(-year, -monthint, -day, -starts_with("cL"), -starts_with("orig"), -hr) %>%
    group_by(isCold) %>%
    summarize_if(is.numeric, .funs=mean, na.rm=TRUE)
## # A tibble: 2 x 13
##   isCold WindSpeed WindGust Visibility Altimeter TempF  DewF modSLP p1Inches
##   <lgl>      <dbl>    <dbl>      <dbl>     <dbl> <dbl> <dbl>  <dbl>    <dbl>
## 1 FALSE      10.2      22.3      10.0       30.0  87.7  69.9  1013.   0.0715
## 2 TRUE        7.35     30         8.96      30.0  74.1  67.8  1014.   0.284 
## # ... with 4 more variables: p36Inches <dbl>, p24Inches <dbl>, tempFHi <dbl>,
## #   tempFLo <dbl>
dfw_jul%>% 
    select(-year, -monthint, -day, -starts_with("cL"), -starts_with("orig"), -hr) %>%
    group_by(isCold) %>%
    summarize_if(is.logical, .funs=mean, na.rm=TRUE)
## # A tibble: 2 x 4
##   isCold  isRain isSnow isThunder
##   <lgl>    <dbl>  <dbl>     <dbl>
## 1 FALSE  0.00565      0    0.0141
## 2 TRUE   0.304        0    0.261
dfw_jul%>% 
    count(isCold, predomDir) %>%
    group_by(isCold) %>%
    mutate(pct=n/sum(n)) %>%
    pivot_wider(predomDir, names_from="isCold", values_from="pct")
## # A tibble: 10 x 3
##    predomDir `FALSE`  `TRUE`
##    <fct>       <dbl>   <dbl>
##  1 000       0.0141   0.0870
##  2 VRB       0.0141  NA     
##  3 NE        0.0155   0.0870
##  4 E         0.0325   0.0435
##  5 SE        0.177    0.304 
##  6 S         0.653    0.348 
##  7 SW        0.0749   0.0435
##  8 W         0.00847 NA     
##  9 NW        0.00282 NA     
## 10 N         0.00847  0.0870
dfw_jul%>% 
    count(isCold, minHeight) %>%
    group_by(isCold) %>%
    mutate(pct=n/sum(n)) %>%
    pivot_wider(minHeight, names_from="isCold", values_from="pct")
## # A tibble: 5 x 3
##   minHeight `FALSE` `TRUE`
##   <fct>       <dbl>  <dbl>
## 1 Surface   0.00424 0.217 
## 2 Low       0.0918  0.174 
## 3 Medium    0.288   0.0870
## 4 High      0.145   0.391 
## 5 None      0.470   0.130
dfw_jul%>% 
    count(isCold, ceilingHeight) %>%
    group_by(isCold) %>%
    mutate(pct=n/sum(n)) %>%
    pivot_wider(ceilingHeight, names_from="isCold", values_from="pct")
## # A tibble: 4 x 3
##   ceilingHeight `FALSE`  `TRUE`
##   <fct>           <dbl>   <dbl>
## 1 Low            0.0127 NA     
## 2 Medium         0.0113  0.0870
## 3 High           0.0297  0.217 
## 4 None           0.946   0.696

It appears that a “cold” July evening in Dallas is associated with precipitation, clouds, and winds from the east. Does this match up with data from July 10?

metarData %>%
    filter(source=="kdfw_2016", month=="Jul", lubridate::day(dtime)==10, lubridate::hour(dtime) < 12) %>%
    pull(origMETAR)
##  [1] "KDFW 100053Z 17006KT 10SM -RA FEW035 FEW060 BKN150 BKN250 24/21 A2994 RMK AO2 PRESFR SLP130 OCNL LTGIC DSNT S CB DSNT S MOV SE P0001 T02440211 $"
##  [2] "KDFW 100153Z 06007KT 10SM FEW060 SCT110 BKN200 25/22 A2997 RMK AO2 RAE32 SLP141 OCNL LTGIC DSNT S CB DSNT S MOV SE P0000 T02500217 $"            
##  [3] "KDFW 100253Z 12003KT 10SM FEW060 FEW110 BKN210 25/20 A2994 RMK AO2 SLP130 OCNL LTGIC DSNT S CB DSNT S MOV SE 60001 T02500200 56023 $"            
##  [4] "KDFW 100353Z 15003KT 10SM FEW120 BKN250 OVC300 24/20 A2996 RMK AO2 SLP136 CB DSNT S MOVD SE T02440200 $"                                         
##  [5] "KDFW 100453Z 12005KT 10SM FEW120 SCT250 BKN300 24/20 A2998 RMK AO2 SLP143 T02440200 $"                                                           
##  [6] "KDFW 100553Z 12006KT 10SM FEW120 SCT250 BKN300 24/20 A2998 RMK AO2 SLP143 60001 T02390200 10256 20239 403500228 51014 $"                         
##  [7] "KDFW 100653Z 18003KT 10SM FEW100 SCT250 SCT300 24/19 A2998 RMK AO2 SLP142 T02440194 $"                                                           
##  [8] "KDFW 100753Z 18003KT 10SM FEW120 SCT300 23/19 A2996 RMK AO2 SLP136 T02330194 $"                                                                  
##  [9] "KDFW 100853Z 00000KT 10SM FEW250 SCT300 23/20 A2995 RMK AO2 SLP131 T02280200 58012 $"                                                            
## [10] "KDFW 100953Z 11004KT 10SM FEW300 23/20 A2995 RMK AO2 SLP134 T02280200 $"                                                                         
## [11] "KDFW 101053Z 15006KT 10SM FEW110 SCT300 22/20 A2995 RMK AO2 SLP135 T02220200 $"                                                                  
## [12] "KDFW 101153Z 16004KT 10SM FEW110 FEW300 23/21 A2996 RMK AO2 SLP138 70003 T02280206 10244 20222 53005 $"

To a degree, yes. There were clouds and rain earlier in the evening, though these had cleared out by the time of the anomaly. Potentially, an area for further exploration.

Next, the evolution of RMSE as variables are added is plotted. The main categorical variables - locale, month, and hour - are added first, as these are largely just the average for the intersection of variables:

# Sample test/train dataset
set.seed(2007221340)
nullData <- regrData %>%
    filter(!is.na(TempF), year==2016, TempF <= 120, DewF <= 100, TempF >= DewF) %>%
    mutate(isTest=runif(nrow(.))<=0.3)

# Null estimates for temperature based on locale; locale-month; and locale-month-hour
nullMeans <- nullData %>%
    filter(!isTest) %>%
    group_by(locNamefct) %>%
    mutate(meanLoc=mean(TempF)) %>%
    group_by(locNamefct, month) %>%
    mutate(meanLocMonth=mean(TempF)) %>%
    group_by(locNamefct, month, hrfct) %>%
    summarize(meanLoc=mean(meanLoc), meanLocMonth=mean(meanLocMonth), meanLocMonthHour=mean(TempF), n=n()) %>%
    ungroup()

# Assessment of accuracies for temperature by locale
nullTest <- nullData %>%
    filter(isTest) %>%
    right_join(nullMeans, by=c("locNamefct", "month", "hrfct")) %>%
    mutate(eLoc=meanLoc-TempF, 
           eLocMonth=meanLocMonth-TempF, 
           eLocMonthHour=meanLocMonthHour-TempF
           )

# Create overall metrics
nullOverall <- nullTest %>%
    mutate(locNamefct="Overall") %>%
    group_by(locNamefct) %>%
    summarize(varTot=var(TempF), 
              eLoc2=mean(eLoc**2), 
              eLocMonth=mean(eLocMonth**2), 
              eLocMonthHour=mean(eLocMonthHour**2)
              )

# Create locale metrics, and bind overall metrics
nullByLocale <- nullTest %>%
    mutate(locNamefct=as.character(locNamefct)) %>%
    group_by(locNamefct) %>%
    summarize(varTot=var(TempF), 
              eLoc2=mean(eLoc**2), 
              eLocMonth=mean(eLocMonth**2), 
              eLocMonthHour=mean(eLocMonthHour**2)
              ) %>%
    bind_rows(nullOverall)

# Create plot for RMSE
nullByLocale %>%
    pivot_longer(-locNamefct, names_to="byNull", values_to="MSE") %>%
    mutate(RMSE=MSE**0.5, 
           byNull=factor(byNull, levels=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour"))
           )%>%
    group_by(locNamefct) %>% 
    mutate(dRMSE=ifelse(row_number()==n(), RMSE, RMSE-lead(RMSE))) %>%
    ungroup() %>%
    ggplot(aes(x=fct_reorder(locNamefct, RMSE), y=dRMSE, fill=byNull)) + 
    geom_col(position="stack") + 
    coord_flip() + 
    scale_fill_discrete("Technique", 
                        breaks=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour"), 
                        labels=c("Overall", "Locale", "Locale-Month", "Locale-Month-Hour"), 
                        guide=guide_legend(reverse=TRUE)
                        ) + 
    labs(x="", y="RMSE", title="RMSE by Locale by Technique (averages only)") + 
    theme(legend.position="bottom")

For predicting overall accuracy, improvements in RMSE are most rapid when moving from locale-only to a combination of locale-month. This pattern appears to be generally repeated for most locales.

How much expalantory power do dewpoint and sea-level pressure have after controlling for locale-month-hour?

# Addition of dew point to training data
nullTrain <- nullData %>%
    filter(!isTest) %>%
    right_join(nullMeans, by=c("locNamefct", "month", "hrfct")) %>%
    mutate(err=meanLocMonthHour-TempF)

# Simple linear regression for dew point vs. error
lm(err ~ DewF, data=nullTrain) %>% 
    summary()
## 
## Call:
## lm(formula = err ~ DewF, data = nullTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.077  -4.146   0.557   4.296  38.408 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.8913389  0.0448067   131.5   <2e-16 ***
## DewF        -0.1272657  0.0008986  -141.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.124 on 182948 degrees of freedom
## Multiple R-squared:  0.09881,    Adjusted R-squared:  0.09881 
## F-statistic: 2.006e+04 on 1 and 182948 DF,  p-value: < 2.2e-16
# Simple linear regression for modSLP vs. error
lm(err ~ modSLP, data=nullTrain) %>% 
    summary()
## 
## Call:
## lm(formula = err ~ modSLP, data = nullTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.254  -4.161  -0.217   3.963  43.079 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.892e+02  2.543e+00  -153.1   <2e-16 ***
## modSLP       3.829e-01  2.502e-03   153.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.065 on 182948 degrees of freedom
## Multiple R-squared:  0.1135, Adjusted R-squared:  0.1135 
## F-statistic: 2.343e+04 on 1 and 182948 DF,  p-value: < 2.2e-16
# Simple linear regression for dew point and modSLP vs. error
lm(err ~ DewF + modSLP, data=nullTrain) %>% 
    summary()
## 
## Call:
## lm(formula = err ~ DewF + modSLP, data = nullTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.274  -4.009   0.352   4.092  39.718 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.177e+02  2.535e+00  -125.3   <2e-16 ***
## DewF        -1.011e-01  8.851e-04  -114.2   <2e-16 ***
## modSLP       3.172e-01  2.485e-03   127.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.826 on 182947 degrees of freedom
## Multiple R-squared:  0.1725, Adjusted R-squared:  0.1725 
## F-statistic: 1.907e+04 on 2 and 182947 DF,  p-value: < 2.2e-16

At a glance, dew point and sea-level pressure both have meaningful explanatory power. In combination, they explain 17% of the variance in error when predictions are made solely based on locale-month-hour.

Does this change when regressions are run on smaller slices of the data (locale-month)?

nullLM <- nullTrain %>%
    nest(data=-c(locNamefct, month)) %>%
    mutate(lmResult=map(data, ~lm(err ~ DewF + modSLP, data=.x)), 
           tidied=map(lmResult, broom::glance)
           ) %>%
    unnest(tidied)

nullLM %>%
    ggplot(aes(x=r.squared)) +
    geom_histogram(bins=10) + 
    labs(x="R-squared", 
         y="# Combinations of Locale-Month", 
         title="R-squared for error ~ dewpoint + sea-level pressure", 
         subtitle="Error based on null model of temperature ~ locale + month + hour"
         )

nullLM %>%
    ggplot(aes(x=fct_reorder(locNamefct, r.squared), y=r.squared)) + 
    geom_boxplot(fill="lightblue") + 
    coord_flip() + 
    labs(x="", 
         y="R-squared", 
         title="R-squared for error ~ dewpoint + sea-level pressure (by locale/month)", 
         subtitle="Error based on null model of temperature ~ locale + month + hour"
         )

nullLM %>%
    ggplot(aes(x=month, y=r.squared)) + 
    geom_boxplot(fill="lightblue") + 
    coord_flip() + 
    labs(x="", 
         y="R-squared", 
         title="R-squared for error ~ dewpoint + sea-level pressure (by locale/month)", 
         subtitle="Error based on null model of temperature ~ locale + month + hour"
         )

Addiing dewpoint and sea-level pressure appears to have better explanatory power in cold-weather locales and in cold-weather months. Explanatory power appears to be lowest in marine and desert locales, and in warm-weather months. In part, this is because there is less overall variation of temperature in these locales, and thus knowing the locale, month, and hour will already drive a very low RMSE. In contrast, RMSE tended to be higher for the cold-weather locales.

The results of the linear regression can then be used to make a new prediction for temperature as follows:

  • There is an original, true temperature, To
  • There is a predicted temperature, Tp, based on locale-month-hour
  • There is a prediction error, Ep, calculated as Tp-To
  • There is an lm model calculating Ep ~ DewF + modSLP
  • Use the lm model to predict a new error, En, for each data element
  • Predict a new temperature, Tn, as Tn=Tp-En
# Merge the lm column to the testing data
nullTestLM <- nullTest %>%
    group_by(locNamefct, month) %>%
    nest(data=-c(locNamefct, month)) %>%
    ungroup() %>%
    left_join(select(nullLM, month, locNamefct, lmResult), by=c("month", "locNamefct"))

# Create new predictions in the nested frame
nullTestLM <- nullTestLM %>%
    mutate(data=map2(.x=data, 
                     .y=lmResult, 
                     ~mutate(.x, errPred=predict(.y, newdata=.x), tempPred=meanLocMonthHour-errPred)
                     )
           )

# Unnest for the actual data
lmNullDewSLP <- nullTestLM %>%
    unnest(data)

Calculate and plot the improvements in RMSE:

newOverall <- lmNullDewSLP %>% 
    mutate(locNamefct="Overall") %>% 
    group_by(locNamefct) %>% 
    summarize(eDewSLP=mean((TempF-tempPred)**2))

newByLocale <- lmNullDewSLP %>% 
    mutate(locNamefct=as.character(locNamefct)) %>% 
    group_by(locNamefct) %>% 
    summarize(eDewSLP=mean((TempF-tempPred)**2)) %>%
    bind_rows(newOverall) %>%
    right_join(nullByLocale, by=c("locNamefct"))

# Create plot for RMSE
newByLocale %>%
    select(-eDewSLP, eDewSLP) %>%
    pivot_longer(-locNamefct, names_to="byNull", values_to="MSE") %>%
    mutate(RMSE=MSE**0.5, 
           byNull=factor(byNull, levels=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP"))
           ) %>%
    group_by(locNamefct) %>% 
    mutate(dRMSE=ifelse(row_number()==n(), RMSE, RMSE-lead(RMSE))) %>%
    ungroup() %>%
    ggplot(aes(x=fct_reorder(locNamefct, RMSE, .fun=max), 
               y=dRMSE, 
               fill=byNull
               )
           ) + 
    geom_col(position="stack") + 
    coord_flip() + 
    scale_fill_discrete("Technique", 
                        breaks=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP"), 
                        labels=c("Impact of Locale", "Impact of Month", "Impact of Hour", 
                                 "Impact of Dew and SLP", "Final"
                                 ), 
                        guide=guide_legend(reverse=TRUE)
                        ) + 
    labs(x="", y="RMSE", title="RMSE by Locale by Technique (averages, then lm for Dew/SLP)") + 
    theme(legend.position="bottom")

The final RMSE by locale are much more even than the original RMSE by locale, driven by the significant impact of (especially) month and dewpoint/SLP in many of the more tricky locales. Is there a similar finding when running random forests by locale?

startTime <- Sys.time()

# Create a container to hold the output
locsFull2016 <- newByLocale %>% 
    filter(locNamefct != "Overall") %>%
    pull(locNamefct)
lstFull2016mhxx <- vector("list", length(locsFull2016))
names(lstFull2016mhxx) <- locsFull2016

# Filter so that TempF exists and that bizarre outliers cannot occur
regrUse2016 <- regrData %>%
    filter(!is.na(TempF), year==2016, TempF <= 120, DewF <= 100, TempF >= DewF)

n <- 1
for (loc in locsFull2016) {

    # Run a regression for TempF vs. month, hour
    lstFull2016mhxx[[n]] <- rfRegression(regrUse2016, 
                                         depVar="TempF", 
                                         predVars=c("month", "hrfct"), 
                                         otherVar=otherVars,
                                         critFilter=list(year=2016, locNamefct=loc), 
                                         seed=2007231345+n, 
                                         ntree=20, 
                                         mtry=2, 
                                         testSize=0.3
                                         )
    
    # Incerement counter
    n <- n + 1
    
    # Update progress
    cat("\nFinished processing:", loc)
    
}
## 
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Detroit, MI
## Finished processing: Grand Rapids, MI
## Finished processing: Green Bay, WI
## Finished processing: Houston, TX
## Finished processing: Indianapolis, IN
## Finished processing: Las Vegas, NV
## Finished processing: Lincoln, NE
## Finished processing: Los Angeles, CA
## Finished processing: Madison, WI
## Finished processing: Miami, FL
## Finished processing: Milwaukee, WI
## Finished processing: Minneapolis, MN
## Finished processing: New Orleans, LA
## Finished processing: Newark, NJ
## Finished processing: Philadelphia, PA
## Finished processing: Phoenix, AZ
## Finished processing: Saint Louis, MO
## Finished processing: San Antonio, TX
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: San Jose, CA
## Finished processing: Seattle, WA
## Finished processing: Tampa Bay, FL
## Finished processing: Traverse City, MI
## Finished processing: Washington, DC
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
## 
## Total run time: 40.9 seconds

The MSE can then be calculated off each of the testData files, with comparisons made to the null data:

mseFull2016mhxx <- tibble::tibble(locNamefct=names(lstFull2016mhxx), 
                                  mse=sapply(lstFull2016mhxx, 
                                             FUN=function(x) {
                                                 mean((x$testData$TempF-x$testData$predicted)**2)
                                                 }
                                             )
                                  )

newByLocale %>% 
    inner_join(mseFull2016mhxx, by="locNamefct") %>% 
    mutate(delta=mse**0.5-eLocMonthHour**0.5) %>% 
    ggplot(aes(x=fct_reorder(locNamefct, -delta), y=delta)) +
    geom_col(fill="lightblue") + 
    coord_flip() + 
    labs(x="", 
         y="Change in RMSE (random forest RMSE minus null RMSE)", 
         title="Impact on RMSE of random forest vs. null model"
         )

Broadly speaking, then random forest regression and the null regression achieve very similar RMSE. Some locales are predicted ~0.25 degrees better while others are predicted ~0.25 degrees worse. Does this pattern hold when adding dewpoint and sea-level-pressure to the random forest regression?

startTime <- Sys.time()

# Create a container to hold the output
locsFull2016 <- newByLocale %>% 
    filter(locNamefct != "Overall") %>%
    pull(locNamefct)
lstFull2016mhds <- vector("list", length(locsFull2016))
names(lstFull2016mhds) <- locsFull2016

# Filter so that TempF exists and that bizarre outliers cannot occur
regrUse2016 <- regrData %>%
    filter(!is.na(TempF), year==2016, TempF <= 120, DewF <= 100, TempF >= DewF)

n <- 1
for (loc in locsFull2016) {

    # Run a regression for TempF vs. month, hour
    lstFull2016mhds[[n]] <- rfRegression(regrUse2016, 
                                         depVar="TempF", 
                                         predVars=c("month", "hrfct", "DewF", "modSLP"), 
                                         otherVar=otherVars,
                                         critFilter=list(year=2016, locNamefct=loc), 
                                         seed=2007231345+n, 
                                         ntree=20, 
                                         mtry=3, 
                                         testSize=0.3
                                         )
    
    # Incerement counter
    n <- n + 1
    
    # Update progress
    cat("\nFinished processing:", loc)
    
}
## 
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Detroit, MI
## Finished processing: Grand Rapids, MI
## Finished processing: Green Bay, WI
## Finished processing: Houston, TX
## Finished processing: Indianapolis, IN
## Finished processing: Las Vegas, NV
## Finished processing: Lincoln, NE
## Finished processing: Los Angeles, CA
## Finished processing: Madison, WI
## Finished processing: Miami, FL
## Finished processing: Milwaukee, WI
## Finished processing: Minneapolis, MN
## Finished processing: New Orleans, LA
## Finished processing: Newark, NJ
## Finished processing: Philadelphia, PA
## Finished processing: Phoenix, AZ
## Finished processing: Saint Louis, MO
## Finished processing: San Antonio, TX
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: San Jose, CA
## Finished processing: Seattle, WA
## Finished processing: Tampa Bay, FL
## Finished processing: Traverse City, MI
## Finished processing: Washington, DC
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
## 
## Total run time: 113.9 seconds

The MSE can then be calculated off each of the testData files, with comparisons made to the original data that combined null means with a linear model of dewpoint and SLP on the residuals:

mseFull2016mhds <- tibble::tibble(locNamefct=names(lstFull2016mhds), 
                                  mse=sapply(lstFull2016mhds, 
                                             FUN=function(x) {
                                                 mean((x$testData$TempF-x$testData$predicted)**2)
                                                 }
                                             )
                                  )

newByLocale %>% 
    inner_join(mseFull2016mhds, by="locNamefct") %>% 
    mutate(delta=mse**0.5-eDewSLP**0.5) %>% 
    ggplot(aes(x=fct_reorder(locNamefct, -delta), y=delta)) +
    geom_col(fill="lightblue") + 
    coord_flip() + 
    labs(x="", 
         y="Change in RMSE (random forest RMSE minus null+linear RMSE)", 
         title="Impact on RMSE of random forest vs. null+linear model"
         )

sapply(lstFull2016mhds, FUN=function(x) { x$rsq }) %>% 
    as.data.frame() %>% 
    mutate(ntree=1:n()) %>% 
    tibble::as_tibble() %>% 
    pivot_longer(-ntree, names_to="locale", values_to="rsq") %>%
    group_by(ntree) %>%
    summarize(meanrsq=mean(rsq)) %>%
    ggplot(aes(x=ntree, y=meanrsq)) + 
    geom_point() + 
    geom_text(aes(y=meanrsq+0.005, label=round(meanrsq, 3))) +
    ylim(c(NA, 1)) + 
    labs(x="# Trees", y="Mean R-squared by locale", title="Impact on R-squared of forest size")

The power of the random forest begins to appear, as RMSE is consistently lower by roughly ~0.5 degrees using the random forest regressions rather than the null means with a basic linear model. Importantly, the random forest regression is able to consider hour, dewpoint, and SLP together for a given locale-month, while the null model used a constant impact of dewpoint and SLP for every hour in the locale-month. Further, the random forest regression can potentially tease out non-linear patterns in the data.

Further, the random forest regression model, while already performing better, is continuing to learn slightly even at around 20 trees. This suggests that a larger forest might potentially drive slightly more prediction improvements.

The progression of RMSE by locale is plotted again:

mseFull2016Overall <- map_dfr(lstFull2016mhds, .f=function(x) { x$testData }) %>%
    summarize(err2=mean((TempF-predicted)**2)) %>%
    pull(err2)

mseFull2016mhds <- mseFull2016mhds %>%
    bind_rows(tibble::tibble(locNamefct="Overall", mse=mseFull2016Overall))

mseByLocale <- newByLocale %>% 
    inner_join(mseFull2016mhds, by=c("locNamefct")) %>%
    rename(rf=mse)

# Create plot for RMSE
mseByLocale %>%
    select(-eDewSLP, -rf, eDewSLP, rf) %>%
    pivot_longer(-locNamefct, names_to="byNull", values_to="MSE") %>%
    mutate(RMSE=MSE**0.5, 
           byNull=factor(byNull, levels=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP", "rf"))
           ) %>%
    group_by(locNamefct) %>% 
    mutate(dRMSE=ifelse(row_number()==n(), RMSE, RMSE-lead(RMSE))) %>%
    ungroup() %>%
    ggplot(aes(x=fct_reorder(locNamefct, RMSE, .fun=max), 
               y=dRMSE, 
               fill=byNull
               )
           ) + 
    geom_col(position="stack") + 
    coord_flip() + 
    scale_fill_discrete("Technique", 
                        breaks=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP", "rf"), 
                        labels=c("Impact of Locale", "Impact of Month", "Impact of Hour", 
                                 "Impact of Dew and SLP", "Impact of Forest", "Final"
                                 ), 
                        guide=guide_legend(reverse=TRUE)
                        ) + 
    labs(x="", y="RMSE", title="RMSE by Locale by Technique (averages, then lm for Dew/SLP, then forest)") + 
    theme(legend.position="bottom")

The modest but meaningful impact of moving to the random forest regression is apparent.

A function is written to integrate testData files from multiple regressions, categorize them, and plot the resulting RMSE. Functions helperGetRMSE() and plotRMSEEvolution() are in WeatherModelingFunctions_v001.R. The functions are then run to compare the categorical-only forest and the categorical-plus-numerical forest:

# Impact of adding dewpoint and SLP to the model
rmseCatvNum <- plotRMSEEvolution(list(catOnly=lstFull2016mhxx, catNum=lstFull2016mhds), 
                                 depVar="TempF", 
                                 mapper=c("catOnly"="Impact of Loc-Month-Hour", 
                                          "catNum"="Impact of Dew-SLP"
                                          )
                                 )

This should simplify RMSE comparisons, and the output file can also be used to calculate R-squared:

rmseCatvNum %>%
    group_by(locNamefct) %>%
    mutate(rsq=1-rmse**2/max(rmse)**2) %>%
    ggplot(aes(x=fct_reorder(locNamefct, rsq, .fun=max), y=rsq, color=mainList, group=mainList)) + 
    geom_point() + 
    coord_flip() + 
    labs(x="", y="R-squared", title="Evolution of R-squared by locale") + 
    scale_color_discrete("", 
                         breaks=c("totVar", "catOnly", "catNum"), 
                         labels=c("No Model", "Categorical Only", "Categorical and Numerical")
                         ) + 
    ylim(c(0, 1)) +
    theme(legend.position="bottom") + 
    geom_text(data=~filter(., mainList=="catNum"), 
              aes(y=rsq+0.01, label=round(rsq, 3)), 
              hjust=0, size=3.5, fontface="bold"
              )

Final predictions are less accurate (higher RMSE) in cold-weather locales, but the amount of variance explained (R-squared) is higher in cold weather locales due to the very high starting variance that is significantly reduced by the modeling.

Next, random forest models, run separately by locale, are created for:

  • Locale-only
  • Locale, month
  • Locale, month, hour
  • Locale, month, hour, dewpoint
  • Locale, month, hour, dewpoint, SLP
  • Locale, month, hour, dewpoint, SLP, Altimeter
  • Locale, month, hour, dewpoint, SLP, Altimeter, wind speed, predominant wind direction, cloud heights
startTime <- Sys.time()

# Variable combinations for running random forests
varCombos <- list(cmb1=c("locNamefct"),
                  cmb2=c("locNamefct", "month"), 
                  cmb3=c("locNamefct", "month", "hrfct"), 
                  cmb4=c("locNamefct", "month", "hrfct", "DewF"), 
                  cmb5=c("locNamefct", "month", "hrfct", "DewF", "modSLP"), 
                  cmb6=c("locNamefct", "month", "hrfct", "DewF", "modSLP", "Altimeter"), 
                  cmb7=c("locNamefct", "month", "hrfct", "DewF", "modSLP", "Altimeter", 
                         "WindSpeed", "predomDir", "minHeight", "ceilingHeight"
                         ) 
                  )

# Variables to be included in all final testData sets
keepVars <- c("source", "dtime", "locNamefct", "month", "hrfct", "DewF", 
              "modSLP", "Altimeter", "WindSpeed", "predomDir", "minHeight", "ceilingHeight"
              )

# Create an overall container to hold the output
lstRFRegCombos <- vector("list", length(varCombos))

# Run through every combination of variables, and run every locale once inside that
nCombo <- 1
for (varCombo in varCombos) {
    
    cat("\nBeginning to process for variables:", varCombo)
    
    # Initialize a list container to be used inside this element of the main list
    lstRFRegCombos[[nCombo]] <- vector("list", length(locsFull2016))
    
    # Create mtry based on length(varCombo)
    mtry <- min(3, length(varCombo)) + max(0, ceiling((length(varCombo)-4)/3))
    cat("\nmtry will be set to:", mtry)
    
    # Create ntree based on length(varCombo)
    ntree <- max(5, min(50, ceiling(0.8 * length(varCombo)**2)))
    cat("\nntree will be set to:", ntree)
    
    # Run once for each locale
    nLoc <- 1
    for (loc in locsFull2016) {
        # Run a regression for TempF vs. varCombo
        lstRFRegCombos[[nCombo]][[nLoc]] <- rfRegression(regrUse2016, 
                                                         depVar="TempF", 
                                                         predVars=varCombo, 
                                                         otherVar=keepVars,
                                                         critFilter=list(year=2016, locNamefct=loc), 
                                                         seed=2007251330, 
                                                         ntree=ntree, 
                                                         mtry=mtry, 
                                                         testSize=0.3
                                                         )
    
        # Incerement counter
        nLoc <- nLoc + 1
        
        # Announce progress
        if ((nLoc %% 5) == 0) cat("\nFinished procesing for:", loc)
    }
    
    # Incerement counter
    nCombo <- nCombo + 1
        
    # Update progress
    cat("\nFinished processing:", varCombo)
    
}
## 
## Beginning to process for variables: locNamefct
## mtry will be set to: 1
## ntree will be set to: 5
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct
## Beginning to process for variables: locNamefct month
## mtry will be set to: 2
## ntree will be set to: 5
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month
## Beginning to process for variables: locNamefct month hrfct
## mtry will be set to: 3
## ntree will be set to: 8
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct
## Beginning to process for variables: locNamefct month hrfct DewF
## mtry will be set to: 3
## ntree will be set to: 13
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF
## Beginning to process for variables: locNamefct month hrfct DewF modSLP
## mtry will be set to: 4
## ntree will be set to: 20
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF modSLP
## Beginning to process for variables: locNamefct month hrfct DewF modSLP Altimeter
## mtry will be set to: 4
## ntree will be set to: 29
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF modSLP Altimeter
## Beginning to process for variables: locNamefct month hrfct DewF modSLP Altimeter WindSpeed predomDir minHeight ceilingHeight
## mtry will be set to: 5
## ntree will be set to: 50
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF modSLP Altimeter WindSpeed predomDir minHeight ceilingHeight
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
## 
## Total run time: 607.8 seconds

Results can then be plotted by picking elements of the main list:

# Key list to run through
# Locale (1)
# Locale-Month-Hour (3)
# Locale-Month-Hour-Dew-SLP (5)
# Everything (7)

keyListPlot <- list(locOnly=lstRFRegCombos[[1]], 
                    locmh=lstRFRegCombos[[3]], 
                    locmhds=lstRFRegCombos[[5]], 
                    locevery=lstRFRegCombos[[7]]
                    )

# Impact of evolving the model
rmseCombos <- plotRMSEEvolution(keyListPlot, 
                                depVar="TempF", 
                                mapper=c("locOnly"="Impact of Locale", 
                                         "locmh"="Impact of Month-Hour", 
                                         "locmhds"="Impact of Dew-SLP",
                                         "locevery"="Impact of all other"
                                         )
                                )

# Plot the evolution of R-squared based on the RMSE data
rmseCombos %>%
    group_by(locNamefct) %>%
    mutate(rsq=1-rmse**2/max(rmse)**2) %>%
    ggplot(aes(x=fct_reorder(locNamefct, rsq, .fun=max), y=rsq, color=mainList, group=mainList)) + 
    geom_point() + 
    coord_flip() + 
    labs(x="", y="R-squared", title="Evolution of R-squared by locale") + 
    scale_color_discrete("", 
                         breaks=c("totVar", "locOnly", "locmh", "locmhds", "locevery"), 
                         labels=c("No Model", "Locale Only", "Locale-Month-Hour", 
                                  "Locale-Month-Hour-Dew-SLP", "Everything"
                                  )
                         ) + 
    ylim(c(0, 1)) +
    theme(legend.position="bottom") + 
    geom_text(data=~filter(., mainList=="locevery"), 
              aes(y=rsq+0.01, label=round(rsq, 3)), 
              hjust=0, size=3.5, fontface="bold"
              )

R-squared is generally in the 95%+ regime, with the exceptions of the Pacific coast, deserts, Denver, and Florida. Denver and the deserts stand out as interesting since they also have among the highest remaining RMSE. In contrast, the low R-squared for the Pacific coast and Florida is based in part on the low underlying temperature variations for these locales.

Next, the models are used to make predictions on data from the same locales but different years:

# Create the list of locales that have data in multiple years
oosLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Pull a dataset that has relevant data
oosData <- regrData %>%
    filter(locNamefct %in% oosLocs, 
           !is.na(TempF)
           )

# Create a container to hold results by locale
oosList <- vector("list", length(oosLocs))
names(oosList) <- oosLocs

# Run for each locale
nLoc <- 1
for (loc in oosLocs) {
    
    # Filter to only data for this locale
    oosHere <- oosData %>%
        filter(locNamefct==loc)
    
    # Get the position for the locale variable
    locPosn <- match(loc, locsFull2016)
    cat("\nMaking predictions for:", loc, "which is in position:", locPosn)
    
    # Make an appropriate sized list for holding all the predictions
    preds <- vector("list", length(lstRFRegCombos))
    
    # Make predictions for each model
    nModel <- 1
    for (mdl in lstRFRegCombos) {
        depVars <- mdl[[locPosn]]$rfModel$importance %>% row.names()
        preds[[nModel]] <- predict(mdl[[locPosn]]$rfModel, newdata=oosHere[, depVars])
        nModel <- nModel + 1
    }

    # Output the model data file with predictions
    oosList[[nLoc]] <- oosHere %>%
        bind_cols(map_dfc(preds, .f=function(x) x))
    
    nLoc <- nLoc + 1
    
}
## 
## Making predictions for: Chicago, IL which is in position: 3
## Making predictions for: Las Vegas, NV which is in position: 11
## Making predictions for: New Orleans, LA which is in position: 18
## Making predictions for: San Diego, CA which is in position: 24

Performance by year and model can be assessed:

# Hard code for the number of models
nModels <- 7

# Combine the datasets in oosList
tblPreds <- map_dfr(oosList, .f=function(x) x, .id="listName")

# Create the error for each scenario
for (intCtr in 1:nModels) {

    tblPreds <- tblPreds %>%
        mutate(err=get(paste0("V", intCtr))-TempF) %>%
        rename_at("err", ~paste0("err", intCtr))
    
}

# Create RMSE by locale and year
oosRMSE <- tblPreds %>%
    group_by(locNamefct, year) %>%
    summarize_at(vars(starts_with("err")), .f=function(x) mean(x**2)**0.5) %>%
    pivot_longer(-c(locNamefct, year), names_to="model", values_to="rmse") %>%
    filter(model %in% c("err1", "err3", "err5", "err7")) %>%
    mutate(model=case_when(model=="err1" ~ "loc only", model=="err3" ~ "add mon/hr", 
                           model=="err5" ~ "add dew/SLP", model=="err7" ~ "add all other", 
                           TRUE~model
                           )
           ) 

# Create plot of RMSE by locale by year
oosRMSE %>%
    ggplot(aes(x=fct_reorder(model, -rmse), y=rmse, fill=factor(year))) + 
    geom_col(position="dodge") + 
    facet_wrap(~locNamefct) + 
    labs(title="RMSE evolution for models built using 2016 data", x="", y="RMSE") + 
    scale_fill_discrete("")

# Create base R-squared by locale and year
baseRMSE <- oosData %>% 
    group_by(locNamefct, year) %>% 
    summarize(basermse=sd(TempF))

# Combine with base RMSE data
oosRMSE %>%
    left_join(baseRMSE, by=c("locNamefct", "year")) %>%
    mutate(rsq=1-rmse**2/basermse**2) %>%
    ggplot(aes(x=factor(year), y=rsq, fill=fct_reorder(model, rsq))) + 
    geom_col(position="dodge") + 
    facet_wrap(~locNamefct) + 
    labs(title="R-squared evolution for models built using 2016 data", x="", y="R-squared") + 
    scale_fill_discrete("") + 
    coord_flip()

The 2016 models perform better on the 2016 data, though there continues to be reduction in RMSE with more complex models in every year (with the exception of Las Vegas). This suggests the model is partially learning features specific to the year 2016 and partially learning features specific to the locale. Creating models based on a larger sample of data (more years) might help generalize the findings.

Next, models are run for the 2014-2019 data for the locales that have data availability (Chicago, IL; Las Vegas, NV; New Orleans, LA; San Diego, CA):

# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Create a main list, one per locale
lstFullData <- vector("list", length(fullDataLocs))


# Create a list of relevant dependent variables and variables to keep
depVarFull <- c('hrfct', 'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 
                'predomDir', 'minHeight', 'ceilingHeight'
                )
keepVarFull <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 'DewF', 'modSLP', 
                 'Altimeter', 'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
                 )


# Run the regressions by locale and month
nLoc <- 1
for (loc in fullDataLocs) {
    
    # Pull data for only this locale, and where TempF is not missing
    pullData <- regrData %>%
        filter(locNamefct==loc, !is.na(TempF))
    
    # Create the months to be run
    fullDataMonths <- pullData %>%
        count(month) %>%
        pull(month)
    
    # Create containers for each run
    lstFullData[[nLoc]] <- vector("list", length(fullDataMonths))
    
    # Run random forest regression for each month for the locale
    cat("\nBeginning to process:", loc)
    nMonth <- 1
    for (mon in fullDataMonths) {
        
        # Run the regression
        lstFullData[[nLoc]][[nMonth]] <- rfRegression(pullData, 
                                                      depVar="TempF", 
                                                      predVars=depVarFull, 
                                                      otherVar=keepVarFull,
                                                      critFilter=list(locNamefct=loc, month=mon), 
                                                      seed=2007271252, 
                                                      ntree=100, 
                                                      mtry=4, 
                                                      testSize=0.3
                                                      )
        
        # Increment the counter
        nMonth <- nMonth + 1
        cat("\nFinished month:", mon)
    }
    
    # Incerement the counter
    nLoc <- nLoc + 1
    
}
## 
## Beginning to process: Chicago, IL
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: Las Vegas, NV
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: New Orleans, LA
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: San Diego, CA
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec

The relevant ‘testData’ files can then be combined for an assessment of overall prediction accuracy:

# Helper function to extract testData from inner list
combineTestData <- function(lst, elem="testData") {
    map_dfr(lst, .f=function(x) x[[elem]])
}

# Combine all of the test data files
fullTestData <- map_dfr(lstFullData, .f=combineTestData) %>%
    mutate(err=predicted-TempF, 
           year=factor(year)
           )

# Helper function to create RMSE data
helperCreateRMSE <- function(df, byVar, depVar, errVar="err") {
    
    df %>%
        group_by_at(vars(all_of(byVar))) %>%
        summarize(varTot=var(get(depVar)), varModel=mean(get(errVar)**2)) %>%
        mutate(rmseTot=varTot**0.5, rmseModel=varModel**0.5, rsq=1-varModel/varTot)
    
}

# Create plot for a given by-variable and facet-variable
helperRMSEPlot <- function(df, byVar, depVar, facetVar=NULL) {

    # Create a copy of the original by variable
    byVarOrig <- byVar
    
    # Expand byVar to include facetVar if facetVar is not null
    if (!is.null(facetVar)) {
        byVar <- unique(c(byVar, facetVar))
    }
    
    # Create 
    p1 <- df %>%
        helperCreateRMSE(byVar=byVar, depVar=depVar) %>%
        select_at(vars(all_of(c(byVar, "rmseTot", "rmseModel")))) %>%
        pivot_longer(c(rmseTot, rmseModel), names_to="model", values_to="rmse") %>%
        group_by_at(vars(all_of(byVar))) %>%
        mutate(dRMSE=ifelse(row_number()==n(), rmse, rmse-lead(rmse)), 
               model=factor(model, levels=c("rmseTot", "rmseModel"))
               ) %>%
        ggplot(aes_string(x=byVarOrig, y="dRMSE", fill="model")) + 
        geom_col() + 
        geom_text(data=~filter(., model=="rmseModel"), aes(y=dRMSE/2, label=round(dRMSE, 1))) +
        coord_flip() + 
        labs(x="", y="RMSE", title="RMSE before and after modelling") + 
        scale_fill_discrete("", 
                            breaks=c("rmseModel", "rmseTot"), 
                            labels=c("Final", "Explained by Model")
                            ) + 
        theme(legend.position="bottom")
    # Add facetting if the argument was passed
    if (!is.null(facetVar)) { p1 <- p1 + facet_wrap(as.formula(paste("~", facetVar))) }
    print(p1)
    
}

# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData, byVar="locNamefct", depVar="TempF")

helperRMSEPlot(fullTestData, byVar="year", depVar="TempF")

helperRMSEPlot(fullTestData, byVar="month", depVar="TempF")

# Facetted by locale
helperRMSEPlot(fullTestData, byVar="year", depVar="TempF", facetVar="locNamefct")

helperRMSEPlot(fullTestData, byVar="month", depVar="TempF", facetVar="locNamefct")

Further, an overall decline in MSE can be estimated as the average of the MSE declines in each locale-month:

# Function to extract MSE data from inner lists
helperMSETibble <- function(x) { 
    map_dfr(x, .f=function(y) tibble::tibble(ntree=1:length(y$mse), mse=y$mse)) 
}

map_dfr(lstFullData, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
    group_by(source, ntree) %>%
    summarize(meanmse=mean(mse)) %>%
    ungroup() %>%
    mutate(source=fullDataLocs[as.integer(source)]) %>%
    ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) + 
    geom_line() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")

At 100 trees, the model appears to have largely completed learning, with no more material declines in MSE. Overall, model predictions average 3-4 degrees different from actual temperatures. Deviations are greater in Las Vegas (4-5 degrees), and in the spring in Chicago (4-5 degrees). Deviations are lesser in San Diego (2-3 degrees) and winter in Chicago (2-3 degrees).

The model is then run for all months combined for a single locale, to compare results when month is a trained explanatory variable rather than a segment modelled separately:

# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Create a main list, one per locale
lstFullData_002 <- vector("list", length(fullDataLocs))


# Create a list of relevant dependent variables and variables to keep
depVarFull_002 <- c('month', 'hrfct', 'DewF', 'modSLP', 
                    'Altimeter', 'WindSpeed', 'predomDir', 
                    'minHeight', 'ceilingHeight'
                    )
keepVarFull_002 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 
                     'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir', 
                     'minHeight', 'ceilingHeight'
                     )


# Run the regressions by locale and month
nLoc <- 1
for (loc in fullDataLocs) {
    
    # Pull data for only this locale, and where TempF is not missing
    pullData <- regrData %>%
        filter(locNamefct==loc, !is.na(TempF))
    
    # To be parallel with previous runs, make a length-one list inside locale
    lstFullData_002[[nLoc]] <- vector("list", 1)
    
    # Run random forest regression for each locale
    cat("\nBeginning to process:", loc)
    lstFullData_002[[nLoc]][[1]] <- rfRegression(pullData, 
                                                 depVar="TempF", 
                                                 predVars=depVarFull_002, 
                                                 otherVar=keepVarFull_002,
                                                 critFilter=list(locNamefct=loc), 
                                                 seed=2007281307, 
                                                 ntree=25, 
                                                 mtry=4, 
                                                 testSize=0.3
                                                 )
    
    # Incerement the counter
    nLoc <- nLoc + 1
    
}
## 
## Beginning to process: Chicago, IL
## Beginning to process: Las Vegas, NV
## Beginning to process: New Orleans, LA
## Beginning to process: San Diego, CA

The results can then be compared to the results of the regressions run using month as a segment:

# Combine all of the test data files
fullTestData_002 <- map_dfr(lstFullData_002, .f=combineTestData) %>%
    mutate(err=predicted-TempF, 
           year=factor(year)
           )

# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_002, byVar="locNamefct", depVar="TempF")

helperRMSEPlot(fullTestData_002, byVar="year", depVar="TempF")

helperRMSEPlot(fullTestData_002, byVar="month", depVar="TempF")

# Facetted by locale
helperRMSEPlot(fullTestData_002, byVar="year", depVar="TempF", facetVar="locNamefct")

helperRMSEPlot(fullTestData_002, byVar="month", depVar="TempF", facetVar="locNamefct")

# Evolution of RMSE
map_dfr(lstFullData_002, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
    group_by(source, ntree) %>%
    summarize(meanmse=mean(mse)) %>%
    ungroup() %>%
    mutate(source=fullDataLocs[as.integer(source)]) %>%
    ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) + 
    geom_line() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")

The prediction qualities and evolution of MSE by number of trees look broadly similar to the results run by locale-month. Notably, month scores high on variable importance:

impList <- lapply(lstFullData_002, FUN=function(x) { 
    locName <- x[[1]]$testData$locNamefct %>% as.character() %>% unique()
    x[[1]]$rfModel$importance %>% 
        as.data.frame() %>%
        rownames_to_column("variable") %>%
        rename_at(vars(all_of("IncNodePurity")), ~locName) %>%
        tibble::as_tibble()
    }
    )

impDF <- Reduce(function(x, y) merge(x, y, all=TRUE), impList)

# Overall variable importance
impDF %>%
    pivot_longer(-variable, names_to="locale", values_to="incPurity") %>%
    ggplot(aes(x=fct_reorder(varMapper[variable], incPurity), y=incPurity)) + 
    geom_col() + 
    coord_flip() + 
    facet_wrap(~locale) + 
    labs(x="", y="Importance", title="Variable Importance by Locale")

# Relative variable importance
impDF %>%
    pivot_longer(-variable, names_to="locale", values_to="incPurity") %>%
    group_by(locale) %>%
    mutate(incPurity=incPurity/sum(incPurity)) %>%
    ggplot(aes(x=fct_reorder(varMapper[variable], incPurity), y=incPurity)) + 
    geom_col() + 
    coord_flip() + 
    facet_wrap(~locale) + 
    labs(x="", y="Relative Importance", title="Relative Variable Importance by Locale")

There is much more underlying variance in the Chicago data, thus greater overall variable importance in Chicago. On a relative basis, locale predictions are driven by:

  • Chicago - Dew Point, Month
  • Las Vegas - Month, Sea-Level Pressure, Hour, Altimeter
  • New Orleans - Dew Point, Month
  • San Diego - Month, Hour, Dew Point

It is interesting to see the similarities in Chicago and New Orleans, with both having strong explanatory power from the combination of dew point and month, despite meaningfully different climates. As in previous analyses, Las Vegas and San Diego look different from each other and also different from Chicago/New Orleans.

Finally, an attempt is made to model temperature using locale as an explanatory variable:

# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Create a main list, one for overall
lstFullData_003 <- vector("list", 1)


# Create a list of relevant dependent variables and variables to keep
depVarFull_003 <- c('locNamefct', 'month', 'hrfct', 
                    'DewF', 'modSLP', 'Altimeter', 
                    'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
                    )
keepVarFull_003 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 
                     'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir', 
                     'minHeight', 'ceilingHeight'
                     )


# Run the overall regressions
# Pull data for only the specified locales, and where TempF is not missing
pullData <- regrData %>%
    filter(locNamefct %in% fullDataLocs, !is.na(TempF))
    
# To be parallel with previous runs, make a length-one list inside locale
lstFullData_003[[1]] <- vector("list", 1)

startTime <- Sys.time()

# Run random forest regression (overall only)
lstFullData_003[[1]][[1]] <- rfRegression(pullData, 
                                          depVar="TempF", 
                                          predVars=depVarFull_003, 
                                          otherVar=keepVarFull_003,
                                          critFilter=list(locNamefct=fullDataLocs), 
                                          seed=2007291317, 
                                          ntree=2, 
                                          mtry=4, 
                                          testSize=0.3
                                          )

endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
## 
## Total run time: 1121.1 seconds

The results can then be compared to the results of the regressions run using locale as a segment:

# Combine all of the test data files
fullTestData_003 <- map_dfr(lstFullData_003, .f=combineTestData) %>%
    mutate(err=predicted-TempF, 
           year=factor(year)
           )

# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_003, byVar="locNamefct", depVar="TempF")

helperRMSEPlot(fullTestData_003, byVar="year", depVar="TempF")

helperRMSEPlot(fullTestData_003, byVar="month", depVar="TempF")

# Facetted by locale
helperRMSEPlot(fullTestData_003, byVar="year", depVar="TempF", facetVar="locNamefct")

helperRMSEPlot(fullTestData_003, byVar="month", depVar="TempF", facetVar="locNamefct")

# Evolution of RMSE
map_dfr(lstFullData_003, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
    group_by(source, ntree) %>%
    summarize(meanmse=mean(mse)) %>%
    ungroup() %>%
    mutate(source="Overall") %>%
    ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) + 
    geom_line() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")

Run time is very significantly longer when using locale as an explanatory variable rather than running separate models by locale. Splitting the models by locale and/or month may create some complexities for summary statistics and some risks of over-fitting. But, the risks may be more than offset by the much lower run times.

What if the model is run on a very small subset of the data (e.g., switch test data from 30% of the sample to 90% of the sample)?

# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Create a main list, one for overall
lstFullData_004 <- vector("list", 1)


# Create a list of relevant dependent variables and variables to keep
depVarFull_004 <- c('locNamefct', 'month', 'hrfct', 
                    'DewF', 'modSLP', 'Altimeter', 
                    'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
                    )
keepVarFull_004 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 
                     'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir', 
                     'minHeight', 'ceilingHeight'
                     )


# Run the overall regressions
# Pull data for only the specified locales, and where TempF is not missing
pullData <- regrData %>%
    filter(locNamefct %in% fullDataLocs, !is.na(TempF))
    
# To be parallel with previous runs, make a length-one list inside locale
lstFullData_004[[1]] <- vector("list", 1)

startTime <- Sys.time()

# Run random forest regression (overall only)
lstFullData_004[[1]][[1]] <- rfRegression(pullData, 
                                          depVar="TempF", 
                                          predVars=depVarFull_004, 
                                          otherVar=keepVarFull_004,
                                          critFilter=list(locNamefct=fullDataLocs), 
                                          seed=2007291412, 
                                          ntree=100, 
                                          mtry=4, 
                                          testSize=0.9
                                          )

endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
## 
## Total run time: 338 seconds

The results can then be compared to the results of the regressions run using locale as a segment:

# Combine all of the test data files
fullTestData_004 <- map_dfr(lstFullData_004, .f=combineTestData) %>%
    mutate(err=predicted-TempF, 
           year=factor(year)
           )

# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_004, byVar="locNamefct", depVar="TempF")

helperRMSEPlot(fullTestData_004, byVar="year", depVar="TempF")

helperRMSEPlot(fullTestData_004, byVar="month", depVar="TempF")

# Facetted by locale
helperRMSEPlot(fullTestData_004, byVar="year", depVar="TempF", facetVar="locNamefct")

helperRMSEPlot(fullTestData_004, byVar="month", depVar="TempF", facetVar="locNamefct")

# Evolution of RMSE
map_dfr(lstFullData_004, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
    group_by(source, ntree) %>%
    summarize(meanmse=mean(mse)) %>%
    ungroup() %>%
    mutate(source="Overall") %>%
    ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) + 
    geom_line() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")

Run time is dramatically faster, with 100 trees on the 10% sample growing in less time than needed for 2 trees to be grown on 70% of the sample. The size of the dataset may be driving run-time challenges more than the complexity of the model.

There is evidence that the smaller training dataset is hurting model predictions. In contrast to previous models using 70% of the data, the model using only 10% of the data appears to be 1 degree less accurate in making predictions on unseen data.

The test data size is decreased to 80% to check the impact:

# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")

# Create a main list, one for overall
lstFullData_005 <- vector("list", 1)


# Create a list of relevant dependent variables and variables to keep
depVarFull_005 <- c('locNamefct', 'month', 'hrfct', 
                    'DewF', 'modSLP', 'Altimeter', 
                    'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
                    )
keepVarFull_005 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 
                     'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir', 
                     'minHeight', 'ceilingHeight'
                     )


# Run the overall regressions
# Pull data for only the specified locales, and where TempF is not missing
pullData <- regrData %>%
    filter(locNamefct %in% fullDataLocs, !is.na(TempF))
    
# To be parallel with previous runs, make a length-one list inside locale
lstFullData_005[[1]] <- vector("list", 1)

startTime <- Sys.time()

# Run random forest regression (overall only)
lstFullData_005[[1]][[1]] <- rfRegression(pullData, 
                                          depVar="TempF", 
                                          predVars=depVarFull_005, 
                                          otherVar=keepVarFull_005,
                                          critFilter=list(locNamefct=fullDataLocs), 
                                          seed=2007291425, 
                                          ntree=25, 
                                          mtry=4, 
                                          testSize=0.8
                                          )

endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
## 
## Total run time: 925.8 seconds

The results can then be compared to the results of the regressions run using locale as a segment:

# Combine all of the test data files
fullTestData_005 <- map_dfr(lstFullData_005, .f=combineTestData) %>%
    mutate(err=predicted-TempF, 
           year=factor(year)
           )

# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_005, byVar="locNamefct", depVar="TempF")

helperRMSEPlot(fullTestData_005, byVar="year", depVar="TempF")

helperRMSEPlot(fullTestData_005, byVar="month", depVar="TempF")

# Facetted by locale
helperRMSEPlot(fullTestData_005, byVar="year", depVar="TempF", facetVar="locNamefct")

helperRMSEPlot(fullTestData_005, byVar="month", depVar="TempF", facetVar="locNamefct")

# Evolution of RMSE
map_dfr(lstFullData_005, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
    group_by(source, ntree) %>%
    summarize(meanmse=mean(mse)) %>%
    ungroup() %>%
    mutate(source="Overall") %>%
    ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) + 
    geom_line() + 
    ylim(c(0, NA)) + 
    labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")

This model splits the differences, being about 0.5 degrees worse than the 70% train data model run by locale and about 0.5 degrees better than the 10% train data model run overall. This supports the blend of art and science necessary for successfully running random forest models.

Next, some simulated data is prepared for investigating the impact of random forest hyperparameters on the quality of the output. In this case, a simple formula will be used where y = x1*x2 + x3. Variables x1 and x2 will be from a Poisson distribution while variable x3 will be from a normal distribution. Variables x4 (normal, correlated to -x1), x5 (random normal variable), and x6 (random factor variable) are also added:

nObs <- 5000
set.seed(2007301252)

rfRnd <- tibble::tibble(x1=rpois(nObs, lambda=4), 
                        x2=rpois(nObs, lambda=6), 
                        x3=rnorm(nObs), 
                        x4=rnorm(nObs, mean=-x1), 
                        x5=rnorm(nObs, sd=3), 
                        x6=factor(sample(letters[1:3], nObs, replace=TRUE), levels=letters[1:3]), 
                        y=x1*x2 + x3
                        )

# Summary statistics
summary(rfRnd)
##        x1               x2               x3                  x4         
##  Min.   : 0.000   Min.   : 0.000   Min.   :-3.361358   Min.   :-14.067  
##  1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.:-0.662320   1st Qu.: -5.441  
##  Median : 4.000   Median : 6.000   Median :-0.016804   Median : -3.875  
##  Mean   : 3.976   Mean   : 5.975   Mean   :-0.006134   Mean   : -3.985  
##  3rd Qu.: 5.000   3rd Qu.: 7.000   3rd Qu.: 0.648586   3rd Qu.: -2.377  
##  Max.   :13.000   Max.   :17.000   Max.   : 3.578669   Max.   :  2.197  
##        x5            x6             y          
##  Min.   :-10.33459   a:1709   Min.   : -2.366  
##  1st Qu.: -2.05727   b:1719   1st Qu.: 11.772  
##  Median : -0.01076   c:1572   Median : 20.357  
##  Mean   : -0.03043            Mean   : 23.770  
##  3rd Qu.:  1.95468            3rd Qu.: 31.888  
##  Max.   : 11.66095            Max.   :117.253
# Correlations
rfRnd %>%
    select(-x6) %>%
    cor() %>%
    round(2)
##       x1    x2    x3    x4    x5     y
## x1  1.00  0.00  0.01 -0.90  0.01  0.74
## x2  0.00  1.00  0.00  0.00 -0.02  0.60
## x3  0.01  0.00  1.00  0.00 -0.02  0.07
## x4 -0.90  0.00  0.00  1.00 -0.02 -0.66
## x5  0.01 -0.02 -0.02 -0.02  1.00 -0.01
## y   0.74  0.60  0.07 -0.66 -0.01  1.00
# Plot of y vs x1/x2
rfRnd %>%
    group_by(x1, x2) %>%
    summarize(n=n(), y=mean(y)) %>%
    ggplot(aes(x=x1, y=x2)) + 
    geom_tile(aes(fill=y)) + 
    geom_text(aes(label=paste0("n=", n))) + 
    scale_fill_continuous(low="white", high="blue") + 
    labs(title="Randomly generated data where y=x1*x2 + x3")

The simulated data appears to be as expected. Train and test datasets are created using a 70/30 split:

trnIdx <- sample(1:nrow(rfRnd), replace=FALSE, size=round(0.7*nrow(rfRnd)))
rfRndTrain <- rfRnd[trnIdx, ]
rfRndTest <- rfRnd[-trnIdx, ]

First, a linear model is run using the ‘known’ formula:

lmRnd <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x1:x2, data=rfRndTrain)
summary(lmRnd)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x1:x2, data = rfRndTrain)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.568e-12 -8.600e-15 -2.300e-15  1.800e-15  1.193e-11 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  1.091e-13  2.306e-14  4.729e+00 2.34e-06 ***
## x1          -1.095e-14  6.272e-15 -1.746e+00    0.081 .  
## x2           3.504e-14  3.474e-15  1.009e+01  < 2e-16 ***
## x3           1.000e+00  3.740e-15  2.674e+14  < 2e-16 ***
## x4           6.315e-16  3.756e-15  1.680e-01    0.866    
## x5          -9.869e-16  1.236e-15 -7.980e-01    0.425    
## x6b          9.086e-16  8.958e-15  1.010e-01    0.919    
## x6c          1.148e-14  9.190e-15  1.250e+00    0.212    
## x1:x2        1.000e+00  7.841e-16  1.275e+15  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.2e-13 on 3491 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.364e+30 on 8 and 3491 DF,  p-value: < 2.2e-16
rfRndOut <- rfRndTest %>%
    mutate(predLM=predict(lmRnd, newdata=.), 
           errLM=predLM-y
           )

# Show RMSE for rfRndOut
rfRndOut %>%
    group_by(x6) %>%
    summarize(baseVar=var(y), lmVar=mean(errLM**2)) %>%
    mutate(lmR2=1-lmVar/baseVar)
## # A tibble: 3 x 4
##   x6    baseVar    lmVar  lmR2
##   <fct>   <dbl>    <dbl> <dbl>
## 1 a        321. 4.70e-26     1
## 2 b        216. 4.71e-26     1
## 3 c        246. 5.33e-26     1

As expected, the linear model using the known formula has 100% explanatory power on both the test and the training datasets.

Next, a random forest regression is run using all the default settings:

# Run the random forest model
rfRndModel <- randomForest::randomForest(y=rfRndTrain$y, 
                                         x=rfRndTrain[, c("x1", "x2", "x3", "x4", "x5", "x6")]
                                         )

# Assess accuracy on training dataset
rfRndModel
## 
## Call:
##  randomForest(x = rfRndTrain[, c("x1", "x2", "x3", "x4", "x5",      "x6")], y = rfRndTrain$y) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 5.701355
##                     % Var explained: 97.82
# Plot evolution of R-squared
tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
    ggplot(aes(x=ntree, y=rsq)) + 
    geom_point() + 
    ylim(c(NA, 1)) + 
    labs(x="# trees", y="R-squared", title="Default parameters, all variables included")

# Plot variable importance
rfRndModel$importance %>% 
    as.data.frame() %>% 
    rownames_to_column("variable") %>% 
    ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) + 
    geom_col(fill="lightblue") + 
    coord_flip() + 
    labs(x="", title="Default parameters, all variables included")

# Assess performance on training data
rfRndOut <- rfRndOut %>%
    mutate(predRF=predict(rfRndModel, newdata=.), 
           errRF=predRF-y
           )

# Show RMSE for rfRndOut
rfRndOut %>%
    group_by(x6) %>%
    summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
    mutate(rfR2=1-rfVar/baseVar)
## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321.  7.44 0.977
## 2 b        216.  3.20 0.985
## 3 c        246.  6.10 0.975

The random forest performs well, though a few limitations stand out:

  • Despite the formula being y=x1*x2 + x3, the model importance returns x1, x2, x4 as the top variables, potentially causing misleading interpretations even as predictions are generally accurate
  • The model never gets to 100% r-squared even on a fairly simple deterministic formula

Part of the issue is that there are 3 unnecessary variables while mtry is set to 2. So, there are meaningful splits where no relevant variable is included, and meaningful splits where the negative correlation random variable is used in place of the positive correlation variable due to availability.

Suppose mtry is increased to 4:

# Run the random forest model
rfRndModel <- randomForest::randomForest(y=rfRndTrain$y, 
                                         x=rfRndTrain[, c("x1", "x2", "x3", "x4", "x5", "x6")], 
                                         mtry=4
                                         )

# Assess accuracy on training dataset
rfRndModel
## 
## Call:
##  randomForest(x = rfRndTrain[, c("x1", "x2", "x3", "x4", "x5",      "x6")], y = rfRndTrain$y, mtry = 4) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 1.294699
##                     % Var explained: 99.51
# Plot evolution of R-squared
tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
    ggplot(aes(x=ntree, y=rsq)) + 
    geom_point() + 
    ylim(c(NA, 1)) + 
    labs(x="# trees", y="R-squared", title="mtry=4, all variables included")

# Plot variable importance
rfRndModel$importance %>% 
    as.data.frame() %>% 
    rownames_to_column("variable") %>% 
    ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) + 
    geom_col(fill="lightblue") + 
    coord_flip() + 
    labs(x="", title="mtry=4, all variables included")

# Assess performance on training data
rfRndOut <- rfRndOut %>%
    mutate(predRF=predict(rfRndModel, newdata=.), 
           errRF=predRF-y
           )

# Show RMSE for rfRndOut
rfRndOut %>%
    group_by(x6) %>%
    summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
    mutate(rfR2=1-rfVar/baseVar)
## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321. 2.21  0.993
## 2 b        216. 0.450 0.998
## 3 c        246. 1.55  0.994

R-squared improves significantly yo ~99.5%. The model correctly puts no importance on x5 and x6 but still puts a high importance on x4 which is known to be unnecessary.

Suppose that the random forest trains only on the three variables known to be important, and is allowed to use all features at every split while drilling down to a nodesize of 1:

# Run the random forest model
rfRndModel <- randomForest::randomForest(y=rfRndTrain$y, 
                                         x=rfRndTrain[, c("x1", "x2", "x3")], 
                                         mtry=3, 
                                         nodesize=1
                                         )

# Assess accuracy on training dataset
rfRndModel
## 
## Call:
##  randomForest(x = rfRndTrain[, c("x1", "x2", "x3")], y = rfRndTrain$y,      mtry = 3, nodesize = 1) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.5997398
##                     % Var explained: 99.77
# Plot evolution of R-squared
tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
    ggplot(aes(x=ntree, y=rsq)) + 
    geom_point() + 
    ylim(c(NA, 1)) + 
    labs(x="# trees", y="R-squared", title="mtry=3, nodesize=1, only includes x1, x2, x3")

# Plot variable importance
rfRndModel$importance %>% 
    as.data.frame() %>% 
    rownames_to_column("variable") %>% 
    ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) + 
    geom_col(fill="lightblue") + 
    coord_flip() + 
    labs(x="", title="mtry=3, nodesize=1, only includes x1, x2, x3")

# Assess performance on training data
rfRndOut <- rfRndOut %>%
    mutate(predRF=predict(rfRndModel, newdata=.), 
           errRF=predRF-y
           )

# Show RMSE for rfRndOut
rfRndOut %>%
    group_by(x6) %>%
    summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
    mutate(rfR2=1-rfVar/baseVar)
## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321. 0.634 0.998
## 2 b        216. 0.234 0.999
## 3 c        246. 0.553 0.998

Performance remains roughly the same, with R-squared of ~99.5% and the overwhelming majority of the importance placed on x1 and x2.

The process is converted to functional form so that it is easier to run:

# Helper function for running and reporting on random forest models
helperRFRandom <- function(dfData=rfRndTrain, 
                           yVar="y", 
                           xVar=c("x1", "x2", "x3", "x4", "x5", "x6"), 
                           ntree=500, 
                           mtry=max(1, floor(length(xVar)/3)), 
                           nodesize=5, 
                           dfErr=rfRndOut, 
                           title=NULL
                           ) {

    # FUNCTION ARGUMENTS    
    # dfData: tibble or data frame containing inpuut data for model
    # yVar: y variable for modeling
    # xVar: x variable(s) for modeling
    # ntree: number of trees to run
    # mtry: Number of variables to consider at each split
    # nodesize: size of terminal nodes
    # dfErr: tibble or data frame containing test data and error estimated
    # title: title for the plots
    
    # Run the random forest model
    rfRndModel <- randomForest::randomForest(y=dfData[, yVar, drop=TRUE], 
                                             x=dfData[, xVar], 
                                             ntree=ntree,
                                             mtry=mtry, 
                                             nodesize=nodesize
                                             )

    # Assess accuracy on training dataset
    print(rfRndModel)

    # Plot evolution of R-squared
    p1 <- tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
        ggplot(aes(x=ntree, y=rsq)) + 
        geom_point() + 
        ylim(c(NA, 1)) + 
        labs(x="# trees", y="R-squared", title=title)
    print(p1)

    # Plot variable importance
    p2 <- rfRndModel$importance %>% 
        as.data.frame() %>% 
        rownames_to_column("variable") %>% 
        ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) + 
        geom_col(fill="lightblue") + 
        coord_flip() + 
        labs(x="", title=title)
    print(p2)

    # Assess performance on training data
    dfErr <- dfErr %>%
        mutate(predRF=predict(rfRndModel, newdata=.), 
               errRF=predRF-y
               )

    # Show RMSE for rfRndOut
    dfErr %>%
        group_by(x6) %>%
        summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
        mutate(rfR2=1-rfVar/baseVar) %>%
        print()
    
}


helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3"), 
               mtry=3, nodesize=1, ntree=100, title="y ~ f(x1, x2, x3); mtry=3, nodesize=1; ntree=100"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.6119342
##                     % Var explained: 99.77

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321. 0.598 0.998
## 2 b        216. 0.232 0.999
## 3 c        246. 0.507 0.998
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x4", "x5", "x6"), 
               mtry=6, nodesize=1, ntree=100, 
               title="y ~ f(x1, x2, x3, x4, x5, x6); mtry=6, nodesize=1; ntree=100"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.8040267
##                     % Var explained: 99.69

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321. 1.11  0.997
## 2 b        216. 0.315 0.999
## 3 c        246. 0.653 0.997

Allowing for all variables to be used at every split reduces x4 to near-zero explanatory power, as expected given that x1 should always be a better choice when both are available at a given split.

Suppose that the model is run once excluding each of the variables:

# Exclude x1
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x2", "x3", "x4", "x5", "x6"), 
               mtry=5, nodesize=1, ntree=250, 
               title="y ~ f(exclude x1); mtry=5, nodesize=1; ntree=250"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 40.15434
##                     % Var explained: 84.65

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321.  43.6 0.864
## 2 b        216.  34.8 0.839
## 3 c        246.  37.5 0.848
# Exclude x2
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x3", "x4", "x5", "x6"), 
               mtry=5, nodesize=1, ntree=250, 
               title="y ~ f(exclude x2); mtry=5, nodesize=1; ntree=250"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 134.2149
##                     % Var explained: 48.7

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321.  144. 0.551
## 2 b        216.  116. 0.462
## 3 c        246.  124. 0.496
# Exclude x3
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x4", "x5", "x6"), 
               mtry=5, nodesize=1, ntree=250, 
               title="y ~ f(exclude x3); mtry=5, nodesize=1; ntree=250"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 1.933025
##                     % Var explained: 99.26

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321.  2.32 0.993
## 2 b        216.  1.19 0.994
## 3 c        246.  1.76 0.993
# Exclude x4
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x5", "x6"), 
               mtry=5, nodesize=1, ntree=250, 
               title="y ~ f(exclude x4); mtry=5, nodesize=1; ntree=250"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.7902814
##                     % Var explained: 99.7

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321. 0.878 0.997
## 2 b        216. 0.320 0.999
## 3 c        246. 0.557 0.998
# Exclude x5
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x4", "x6"), 
               mtry=5, nodesize=1, ntree=250,
               title="y ~ f(exclude x5); mtry=5, nodesize=1; ntree=250"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.8016457
##                     % Var explained: 99.69

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321. 1.06  0.997
## 2 b        216. 0.285 0.999
## 3 c        246. 0.701 0.997
# Exclude x6
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x4", "x5"), 
               mtry=5, nodesize=1, ntree=250, 
               title="y ~ f(exclude x6); mtry=5, nodesize=1; ntree=250"
               )
## 
## Call:
##  randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE],      ntree = ntree, mtry = mtry, nodesize = nodesize) 
##                Type of random forest: regression
##                      Number of trees: 250
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.8493107
##                     % Var explained: 99.68

## # A tibble: 3 x 4
##   x6    baseVar rfVar  rfR2
##   <fct>   <dbl> <dbl> <dbl>
## 1 a        321. 1.05  0.997
## 2 b        216. 0.256 0.999
## 3 c        246. 0.675 0.997

Broadly, results are as expected:

  • Excluding x2 has the highest impact since there is no other explanatory variable correlated to x2
  • Excluding x1 has a lesser but still significant impact since x4 is correlated to x1 but with meaningful dissimilarity also
  • Excluding x3 has a modest impact since it is a small piece (variance wise) of the formula
  • Excluding x4/x5/x6 provide materially the same result as including them, as expected since they are not needed to fully explain y

A look at extrapolation can also be informative. Suppose the formulae remain the same but there has been a 40% expansion of all data in rfTest:

rfRndExtrap <- rfRndTest %>%
    mutate(x1=round(1.4*x1), x2=round(1.4*x2), x3=1.4*x3, x4=1.4*x4, x5=1.4*x5, y=x1*x2+x3)

rfRndExtrap %>%
    mutate(pred=predict(rfRndModel, newdata=.)) %>%
    ggplot(aes(x=y, y=pred)) + 
    geom_point(alpha=0.2) + 
    geom_smooth() + 
    geom_abline(color="red", lty=2) + 
    labs(x="Actual y", y="Predicted y", title="Random forest extrapolation when all variables scaled by 40%")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

As the random forest model starts encountering x1 and x2 that are “out of bounds” from the training set, the predictions become significantly worse. The random forest model has not learned anything mathematical that can be extrapolated; instead, it has learned the relationships of the features specific to the training data bounds that were provided.

Contrast these findings with an extrapolated linear model:

rfRndExtrap %>%
    mutate(pred=predict(lmRnd, newdata=.)) %>%
    ggplot(aes(x=y, y=pred)) + 
    geom_point(alpha=0.2) + 
    geom_smooth() + 
    geom_abline(color="red", lty=2) + 
    labs(x="Actual y", y="Predicted y", title="Linear model extrapolation when all variables scaled by 40%")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The linear model has learned the mathematical relationship and can extrapolate outside the bounds of the training data. There is always risk to extrapolation, but the linear model is not subject to an artifical hard ceiling and floor as is the case for the random forest model.

Further, suppose that x3 were to become a more important variable, with sd=20 rather than the baseline sd=1:

rfRndExtrap <- rfRndTest %>%
    mutate(x3=20*x3, y=x1*x2+x3)

rfRndExtrap %>%
    mutate(pred=predict(rfRndModel, newdata=.)) %>%
    ggplot(aes(x=y, y=pred)) + 
    geom_point(alpha=0.2) + 
    geom_smooth() + 
    geom_abline(color="red", lty=2) + 
    labs(x="Actual y", y="Predicted y", title="Random forest model extrapolation when x3 scaled by 20")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

rfRndExtrap %>%
    mutate(pred=predict(lmRnd, newdata=.)) %>%
    ggplot(aes(x=y, y=pred)) + 
    geom_point(alpha=0.2) + 
    geom_smooth() + 
    geom_abline(color="red", lty=2) + 
    labs(x="Actual y", y="Predicted y", title="Linear model extrapolation when x3 scaled by 20")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The random forest becomes quite inaccurate since it never needed to learn much about x3 to make good predictions. The linear model having learned that the coefficient for x3 is 1.00 handles a significant change in x3 while still making accurate predictions.

While the random forest model performs outstanding on the bias/variance tradeoff for a given dataset, it can become highly biased when extrapolated for input values that are not congruent with the training data bounds. This is driven in large part by the forced ceiling and floor; a random forest will never predict a category that it has not seen, nor can it predict a value above/below the range of values it has seen.